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PROBABILITY AND STATISTICS 


Foreword 


The Indian Statistical Institute (ISI) was established on 17th December, 
1931 by a great visionary Prof. Prasanta Chandra Mahalanobis to promote 
research in the theory and applications of statistics as a new scientific disci- 
pline in India. In 1959, Pandit Jawaharlal Nehru, the then Prime Minister 
of India introduced the ISI Act in the parliament and designated it as an 
Institution of National Importance because of its remarkable achievements 
in statistical work as well as its contribution to economic planning. 

Today, the Indian Statistical Institute occupies a prestigious position 
in the academic firmament. It has been a haven for bright and talented 
academics working in a number of disciplines. Its research faculty has done 
India proud in the arenas of Statistics, Mathematics, Economics, Com- 
puter Science, among others. Over seventy five years, it has grown into a 
massive banyan tree, like the institute emblem. The Institute now serves 
the nation as a unified and monolithic organization from different places, 
namely Kolkata, the Headquarters, Delhi, Bangalore, and Chennai, three 
centers, a network of five SQC-OR Units located at Mumbai, Pune, Baroda, 
Hyderabad and Coimbatore, and a branch (field station) at Giridih. 

The platinum jubilee celebrations of ISI have been launched by Honor- 
able Prime Minister Prof. Manmohan Singh on December 24, 2006, and 
the Government of India has declared 29th June as the “Statistics Day” to 
commemorate the birthday of Prof. Mahalanobis nationally. 

Professor Mahalanobis, was a great believer in interdisciplinary research, 
because he thought that this will promote the development of not only 
Statistics, but also the other natural and social sciences. To promote in- 
terdisciplinary research, major strides were made in the areas of computer 
science, statistical quality control, economics, biological and social sciences, 
physical and earth sciences. 

The Institute’s motto of “unity in diversity” has been the guiding prin- 
ciple of all its activities since its inception. It highlights the unifying role 
of statistics in relation to various scientific activities. 
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In tune with this hallowed tradition, a comprehensive academic pro- 
gramme, involving Nobel Laureates, Fellows of the Royal Society, Abel 
prize winner and other dignitaries, has been implemented throughout the 
Platinum Jubilee year, highlighting the emerging areas of ongoing frontline 
research in its various scientific divisions, centers, and outlying units. It 
includes international and national-level seminars, symposia, conferences 
and workshops, as well as series of special lectures. As an outcome of these 
events, the Institute is bringing out a series of comprehensive volumes in 
different subjects under the title Statistical Science and Interdisciplinary 
Research, published by the World Scientific Press, Singapore. 

The present volume titled “Perspectives in Mathematical Sciences I: 
Probability and Statistics” is the seventh one in the series. The volume 
consists of eleven chapters, written by eminent probabilists and statisti- 
cians from different parts of the world. These chapters provide a current 
perspective of different areas of research, emphasizing the major challeng- 
ing issues. They deal mainly with statistical inference, both frequentist 
and Bayesian, with applications of the methodology that will be of use to 
practitioners. I believe the state-of-the art studies presented in this book 
will be very useful to both researchers as well as practitioners. 

Thanks to the contributors for their excellent research contributions, 
and to the volume editors Profs. N. S. Narasimha Sastry, T. S.S. R. K. Rao, 
M. Delampady and B. Rajeev for their sincere effort in bringing out the 
volume nicely in time. Initial design of the cover by Mr. Indranil Dutta is ac- 
knowledged. Sincere efforts by Prof. Dilip Saha and Dr. Barun Mukhopad- 
hyay for editorial assistance are appreciated. Thanks are also due to World 
Scientific for their initiative in publishing the series and being a part of the 


Platinum Jubilee endeavor of the Institute. 


December 2008 Sankar K. Pal 
Kolkata Series Editor and 
Director 


Preface 


Indian Statistical Institute, a premier research institute founded by Pro- 
fessor Prasanta Chandra Mahalanobis in Calcutta in 1931, celebrated its 
platinum jubilee during the year 2006-07. On this occasion, the institute 
organized several conferences and symposia in various scientific disciplines 
in which the institute has been active. 

From the beginning, research and training in probability, statistics and 
related mathematical areas including mathematical computing have been 
some of the main activities of the institute. Over the years, the contribu- 
tions from the scientists of the institute have had a major impact on these 
areas. 

As a part of these celebrations, the Division of Theoretical Statistics and 
Mathematics of the institute decided to invite distinguished mathematical 
scientists to contribute articles, giving “a perspective of their discipline, 
emphasizing the current major issues”. A conference entitled “Perspectives 
in Mathematical Sciences” was also organized at the Bangalore Centre of 
the institute during February 4-8, 2008. 

The articles submitted by the speakers at the conference, along with 
the invited articles, are brought together here in two volumes (Part A and 
Part B). 

Part A consists of articles in Probability and Statistics. Articles in 
Statistics are mainly on statistical inference, both frequentist and Bayesian, 
for problems of current interest. These articles also contain applications 
illustrating the methodologies discussed. The articles on probability are 
based on different “probability models” arising in various contexts (ma- 
chine learning, quantum probability, probability measures on Lie groups, 
economic phenomena modelled on iterated random systems, “measure free 
martingales”, and interacting particle systems) and represent active areas 
of research in probability and related fields. 

Part B consists of articles in Algebraic Geometry, Algebraic Num- 
ber Theory, Functional Analysis and Operator Theory, Scattering Theory, 


vil 


viii Preface 


von Neumann Algebras, Discrete Mathematics, Permutation Groups, Lie 
Theory and Super Symmetry. 

All the authors have taken care to make their exposition fairly self- 
contained. It is our hope that these articles will be valuable to researchers 
at various levels. 

The editorial committee thanks all the authors for writing the articles 
and sending them in time, the speakers at the conference for their talks and 
various scientists who have kindly refereed these articles. Thanks are also 
due to the National Board for Higher Mathematics, India, for providing 
partial support to the conference. Finally, we thank Ms. Asha Lata for her 
help in compiling these volumes. 


October 16, 2008 N. S. Narasimha Sastry 
I. Se By Ks EK Bao 

Mohan Delampady 

B. Rajeev 
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1.1. Introduction 


This article discusses the concepts of relative entropy of a probability mea- 
sure with respect to a dominating measure and that of measure free mar- 
tingales. There is considerable literature on the concepts of relative entropy 
and standard martingales, both separately and on connection between the 
two. This paper draws from results established in [1] (unpublished notes) 
and [6]. In [1] the concept of relative entropy and its maximization subject 
to a finite as well as infinite number of linear constraints is discussed. In [6] 
the notion of measure free martingale of a sequence { f,,}°°., of real valued 
functions with the restriction that each f, takes only finitely many distinct 
values is introduced. Here is an outline of the paper. 

In section 1.2 the concepts of relative entropy and Gibbs-Boltzmann 
measures, and a few results on the maximization of relative entropy and 
the weak convergence of of the Gibbs-Boltzmann measures are presented. 
We also settle in the negative a problem posed in [6]. In section 1.3 the no- 
tion of measure free martingale is generalized from the case of finitely many 
valued sequence {f,,}°2, to the general case where each f,, is allowed to 
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be a Borel function taking possibly uncountably many values. It is shown 
that every martingale is a measure free martingale, and conversely that 
every measure free martingale admits a finitely additive measure on a cer- 
tain algebra under which it is a martingale. Conditions under which such 
a measure is countably additive are given. Last section is devoted to an ab 
initio discussion of the existence of an equivalent martingale measure and 
the uniqueness of such a measure if they are chosen to maximize certain 
relative entropies. 


1.2. Relative Entropy and Gibbs-Boltzmann Measures 


1.2.1. Entropy Mazximization Results 


Let (Q,8,) be a measure space. A 6- measurable function f : Q — 
(0, my called a Peal yee Haat (p.d.f) with respect to p if 
Jo f(w)u(dw) = 1. Then v»(A) = J, f(w)u(dw), A € B, is a probability 
measure ane by p. The ee en of vy with respect to ju is 
defined as 


: [ fle) log f wud) (1) 


provided the integral on the right hand side is well defined, although it 
may possibly be infinite. In particular, if 4 is a finite measure, this 
holds pre the function h(x) = xloga is bounded on (0,1) and hence 
Ja(—f (w) log f(w))* u(dw) < oo. This does allow for the possibility that 
Bf, i a be —oo when pi(Q) is finite. We will show below that if p4(Q) 
is finite and positive then H(f, 1) < log w(Q) for all p.d.f. f with respect 
to uw. In particular if w(Q) = 1, A(f,u) <0 
We recall here for the benefit of the reader that a B measurable non- 
negative real valued function f ak has a well defined integral with 
respect to yp. It is denoted by hawt )u(dw). The integral may be finite 
or infinite. A real valued B Sih de function f can be written as f = 
f+ — f_, where, for each w € 2, 


f+(w) = max{0, f(w)}, f-(w) = — min{0, f(w)}. 
If at least one of f,, f— has a finite integral, then we say that f has a well 
defined integral with respect to and write 


[ f(w)p(dw) = if f(w) (dw) — [ f-(w)4(de) 
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Now note the simple fact from calculus. The function ¢(a#) = x—1—log x 
on (0, 00) has a unique minimum at x = 1 and ¢(1) = 0. Thus for all x > 0, 
log a2 < x —1 with equality holding if and only if x = 1. So if f; and fo are 
two probability density functions on (Q, B, 1), then for all w, 


fiw) log fo(w) — fiw) log fil’) < fe) — Al), (2) 


with equality holding if and only if fi(w) = fo(w). Assume now that 
fi(w) log fi(w), fi(w) log fo(w) have definite integrals with respect to and 
that one of them is finite. On integrating the two sides of (2) we get 


[ fre)oe folwyu(ds) — f fale) tog fa(w) (de) 
Q Q 


< [ talwyntde) — f filwyntae 


=1-1=0. 


The middle inequality becomes an equality if and only if equality holds 
in (2) a.e. with respect to yw. We have proved: 


Proposition 2.1. Let (Q,B,w) be a measure space and let fi, fo be 
two probability density functions on (Q,B, ys). Assume that the functions 
fi log fi, fi log fo have definite integrals with respect to ~ and that one of 
them is finite. Then 


Hh =~ [ filw)tos filw)u(de) < ~ f fr(e)log folw)u(ae), (3) 
with equality holding if and only if fi(w) = fo(w), ae. pw. 


Note that if (Q) is finite and positive and if we set fo(w) = (u(Q))~', 
for all w, then the right hand side of (3) becomes log 4(Q) and we conclude 
that relative entropy of H(f1, 4) is well defined and at most log p(Q). 

Let fo be a probability density function on (Q,B,) such that ’\ = 
HA (fo, ) is finite and let 


Fyastfifapaf. wrt pond - [ #) log fo(w)u(dw) =A}. = (4) 
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From Proposition 2.1 it follows that for any f € Fy, 


=- f fw ) log f(w)pu(dus) < <- | fw ) log fo(w)4a(dw) = 


_ [ fo(w) log fo(w)u(dw), 


with equality holding if and only if f = fo, a.e. 4. We summarize this as: 


Theorem 2.1. Let fo,A, Fy, be as in (4) above. Then 


sup{H(f,u): f © Fy} = A(fo, u) 


and fo is the unique maximiser. 


Theorem 2.1. says that any probability density function fo with respect 
to y with finite entropy relative to 4 appears as the unique solution to an 
entropy maximizing problem in an appropriate class of probability density 
functions. Of course, this assertion has some meaning only if 7, does not 
consist of fo alone. A useful reformulation of this result is as follows: 


Theorem 2.2. Let h:0Q—R be a B measurable function. Let c and X be 
real numbers such that 


w(c) = [rule < CS, [ | h(w) | e° (dw) < 00, and 


\ — La hw)erh (de) 
7 


ch 


(2) and let 


Let fo = “i 


Fy, ={f :f apa.f.-wrt p, and [ feoneyn(a) =A}. 
Then 


sup{H(f,4): f € Fr} = B(fo, p), 


and fo is the unique maximiser. 


Here are some sample examples of the above considerations. 
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Example 1. Let 9 = {1,2,---,N},N < oo, uw counting measure on Q), 
h=1,A=1. Then Fy = {(p:)2,, 0: = 0,00, pi = 1}. Then 


I 3 
fol) = nw’? — Layee ylN, 
the uniform distribution on {1,2,---,N}, maximizes the relative entropy 


of the class Fy with respect to pu. 


Example 2. Let 2 = N, the natural numbers {1,2,---}, «4 = the counting 
measure on 2, h(j) = 7,7 € N. Fix A, 1 < A < co and let 


Fy {Opa eV h 2720; > 5 = 1, >, og — 
j=l j=l 
Then fo(j) = (1 — p)p’—1,7 = 1,2,---, where p = 1 — x; maximizes the 
relative entropy of the class F, with respect to p. 
Example 3. Let 2 = R, pp = Lebesgue measure on R, h(x) = 277,0<A< 
coo: Set = ff: f 20, , f(e)de=1,f,a*f (ede =A}. Then 
1 2 


fo(x) = wae 


i.e., the Gaussian distribution with mean zero and variance A, maximizes 
the relative entropy of the class F, with respect to the Lebesgue measure. 

These examples are well known (see [5]) and the usual method is by the 
use of Lagrange’s multipliers. The present method extends to the case of 
arbitrary number of constraints (see [1], [8]). 


Definition 2.1. Let (Q,8,) be a measure space. Let h: Q — R be B 
measurable and let c be a real number. Let ~(c) = fa ec?) 14(dw) be finite. 
Let 


ip eh) (dw) 
v(c) 

The probability measure 1,,-,n) is called the Gibbs-Boltzmann measure cor- 

responding to (1, ¢, h). 


V,c,h) (A) = ,A EB. 


Example 4. (Spin system on N states.) Let Q = {—1,1}%, N a positive 
integer, and let V : Q — R, 0 < Br < o& be given. Let yz denote the 
counting measure on 92. The measure 


> ee ere) 


“(w,8r,v)(A) = Socg CBF)” AcQ, (5) 
We 
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is called the Gibbs distribution with potential function V and temperature 
constant Gr for the spin system of N states. The denominator on the right 
side of (5) is known as the partition function. 


1.2.2. Weak Convergence of Gibbs-Boltzmann Distribution 


Let 2 be a Polish space, i.e., a complete separable metric space and let B 
denote its Borel o-algebra. Recall that a sequence ({1;,)°2, of probability 
measures on ((Q, 8) converges weakly to a probability measure jz on (Q, B), 


if 
[ teuntae) — f reyn(ae) 


for every continuous bounded function f : Q — R. Now let (fin)°2, bea 
sequence of probability measures on (2, B), (hn )°2, a sequence of B measur- 
able functions from 2 to R and (c,,)°2_,; a sequence of real numbers. Assume 
that for each n > 1, fi een?») u(dw) < oo. For each n > 1, let (y,,,,en,hn) 
be the Gibbs-Boltzmann measure corresponding (jin, Cn, Mn) as in definition 
2.1. An important problem is to find conditions on (jin, An, Cn)°2 1 so that 
(Vin ,enshn) nei Converges weakly. We address this question in a somewhat 
special context. We start with some preliminaries. 

Let C C R be a compact subset and yz a probability measure on Borel 
subsets of R with support contained in C.. For c € R, let 


wo = [ e*w(ae). 


Since C is bounded and yp is a probability measure, the function w is well 
defined and infinitely differentiable on R. For any k > 1, 


VO) = f eat ular). 


Note that the function f,.(~) = <— is a probability density function with 


respect to with mean ¢(c) = De" 


Proposition 2.2. 


(i) @ is infinitely differentiable and ¢'(c) > 0 for all c € R, provided 
is not supported on a single point. If 4 is a Dirac measure, 1.e., if 
pu is supported on a single point, then ¢'(c) = 0 for all c. 
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(ii) lime.—oo O(c) = inf{x : u(—co, x) > O} =a, 
(iii) lime.400 O(c) = sup{z : p[z, 00) > 0} =, 
(iv) for anya,a<a<b, there is a unique c such that d(c) =a. 


Proof: If j is a Dirac measure then the claims are trivially true, so we 
assume that js is not a Dirac measure. Since w is infinitely differentiable 
and w(c) > 0 for all c, ¢ is also infinitely differentiable. Moreover, 


_ Uo xe n(dx) vo) — (0)? 
(w(c))? 


can be seen as the variance of a non-constant random variable X, whose 


(Cc) 


distribution is absolutely continuous with respect to with probability den- 
sity function f.(a) = OE (Note that X. is non-constant since jz is not 
concentrated at a single point and f, is positive on the support of yz.) Thus 
¢'(c) = variance of X, > 0, for all c. This proves (i). 


Although a direct verification of (ii) is possible we will give a slightly 
different proof. We will show that as c — —oo, the random variable X, 
converges in distribution to the constant function a so that ¢(c) which is 
the expected value of X, converges to a. Note that by definition of a, for all 
€ > 0, u([a,a+e)) > 0 while x((—o0, a)) = 0. Also ((b, c0)) = 0, whence 


P(X, >a+te) Sate CH) — Jaren 0” Pu(dz) 
Cc a €) = -___ SUS — sO... 
WO) Tray Hd) 


For c <0, and 0 <e < 54, 


P(X, >at+e)< 


Also, since u((—co,a)) = 0, P(X. <a) =0. So, X. — a in distribution 
as c — —oo, whence ¢(c) — a as c + —oo. This proves (ii). Proof of (iii) 
is similar. Finally (iv) follows from the intermediate value theorem since ¢ 
is strictly increasing and continuous with range (a,b). This completes the 
proof of Proposition 2.2. 

Proposition 2.2 also appears at the beginning of the theory of large 
deviations (see [10]) thus giving a glimpse of the natural connection between 
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large deviation theory and entropy theory. (See Varadhan’s interview, p31, 
(11}.) 

The requirement that j have compact support can be relaxed in the 
above proposition. The following is a result under a relaxed condition. 

Let ys be a measure on R. Let I = {c: fe u(dx) < co}. It can be 
shown that I is a connected subset of R which can be empty, a singleton, 
or an interval that is half open, open, closed, finite, semi-finite or all of R 
(see [2]). Suppose J has a non-empty interior J°. Then in J° the function 
W(c) = fp e°u(dz) is infinitely differentiable with YW (c) = f, e©x* u(dz). 


Further $(c) = a satisfies 


_v"(o- WO) 
(Wc)? 


which is positive, being equal to the variance of a random variable with 


$'(c) 


c 
fying inf.cyo d(c) <a < co #(c), there is a unique cp in I° such that 
(co) = a. 

Let ys be a probability measure on R. Note that a real number X is 
the mean f, «v(da) of a probability measure v absolutely continuous with 
respect to yu if and only if w({x: a < A}), w({x: x > A}) are both positive. 

As a corollary of Proposition 2.2 we have: 


probability density function with respect to yw. Thus, for any a satis- 


Corollary 2.1. Let the closed support of 1 be a compact set C. Let a be 
such that p{a:a<ah,u{x: 2 >a} are both positive. Let 


Foc leat ape, i. arf ()us(de) = a. 


Then there exists a unique probability density function g with respect to ju 
such that 


A(g,u) = max{H(f, pu): f € Fa}. 


If a=infC or if a=supC, then p necessarily assigns positive mass to a 
and g = Tea X leq}. Let infC <a <supC. Then there is a unique c 


such that with g = f. = one has a= J, xfe(x)u(dx) and 


2? 
Joe w(de) 


A(g, ow) = A(fe, w) = max{H(f,u) = f © Fa}. 


Entropy and Martingale 9 


Keeping in mind the notation of the above corollary, and the fact that 
q@ uniquely determines c, we write V,,, to denote the probability measure 
fds, i-e., the measure with probability density function f. with respect to 
uu. It is also the Gibbs-Boltzmann measure v,,¢,, with h(a) = x. We are 
now ready to state the result on the weak convergence of Gibbs-Boltzmann 
measures. 


Theorem 2.3. Let C be a compact subset of R. Let (fin)?@1 be a sequence 
of probability measures such that (i) support of each [tn is contained in C, 
(ii) Un — pw weakly. Let 


a = inf{zx: p((—oo,z)) > 0},b = sup{z : p((x, 00)) > OF. 


Leta<a<b. Then for all large n, Va,y,, 18 well defined and Vein, > Voip 
weakly. 


Proof: Since pn, — weakly, and a <a < }, for n large, un(—oo, a) > 0, 
Lin(a@,0o) > 0. So, by Proposition 2.2 it follows that there is a unique cy, 


such that with f.,, = Lone E 
JC nm 


[ fen (2) fin (dz) = a 


Thus Vg,,,, is well defined for all large n. Next we claim that c,’s are 


[e.) 


bounded. If c,’s are not bounded, then there is a subsequence of (c,)°2, 


which diverges to —oo or to +00. Suppose a subsequence of (c,,)°°, diverges 

to —oo. Note that for all « > 0, and c, < 0, 

Irate EP (de) ett ([a + €, 00) 
fae Hn(da) ~ eBin([asa+ §)), 

Since Ut, — pw weakly, for each € > 0, liminfn.o Un([a,a+e)) > 0. 


(6) 


Ver tin (a+ €,00) = 


Therefore, over the subsequence in question, Vq,,,,, ([a + €,00)) — 0 by (6), 
and since Vq,,,,,((—0o, a)) = 0 we see that (vo, )°2.1 converges weakly to 
Dirac measure at a. Since C is compact, this implies that [., rVz,,,,, (dx) — 
a as n — oo, contradicting the fact that [,rvay,,,(dx) = a > a, for all 
n. Similarly, (c,)°2, is bounded above. So (c,)°@, is a bounded sequence, 
which in fact converges as we see below. Let a subsequence (cp, )72, con- 
verge to areal number c. Then, since fz, — ps weakly, and since all jz, have 
support contained in C’, a compact set, we see that 


[etm (de) > fe y(dz), 
C C 
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and 
[ ver tm (de) > | xe jada), 
C C 


as k — oo, whence, a = $(cn,) — ¢(c), so that ¢(c) = a. Again, by 
Proposition 2.2., c is uniquely determined by a, so that all subsequential 
limits of (cy |?2-4 
C is compact, fc,, > fe uniformly on C, so that vay, > Va,, weakly, thus 


are the same. So cn, — c as n — oo. Clearly then, since 


proving Theorem 2.3. 

This theorem allows us to answer a question raised in [6]. Let C be a 
compact subset of R. For each € > 0 let C. = {@.1 < %e2°-+ < Len. } be 
a e-net in C, ie., for all x € C, there is a x; such that | z — rj |< e. 
Fix a such that inf C < a < supC. Then for small enough ¢€ it must hold 
that 1 <a < @ x. Let pi. be the uniform distribution on C.. Let vy,..¢ 
be the Gibbs-Boltzmann distribution on C, corresponding to pe and a. 
The problem raised in [6] was whether v,,,,, converges to a unique limit as 
€ — 0. Theorem 2.3. above answers this in the negative. Take C' = [0, 1], 
and let (a,)°2, be a sequence of points in C which become dense in C 
and such that if jz, denotes the uniform distribution on the first n points 
of the sequence, then the sequence (1, )°2., has no unique weak limit. By 
Theorem 2.3, the associated Gibbs-Boltzmann distributions (v,,,,,)°@ will 
also not have a unique weak limit. 

(Here is a way of constructing such a sequence (#,)°2,. Let ju and ple 


be two different probability measures on [0,1] both equivalent to Lebesgue 


measure. Let (Xn,)°2), (Yn)°21 be two sequences of points in [0, 1] such that 
the sequence of empirical distributions based on (X,,)°<, converges weakly 
to i and that based on (Y,,)°2, converges weakly to juz. Let (Z;)%, be 


defined as follows: 
4,=%1,1<151,4=VY%,m <i< ne, 4 = Xi,n2<t<1g,-°°. 


One can choose ny < ng < ng < --- in such a way that the empirical 
distribution of (Z;)°2, converges to 41 over the sequence (n2~+41)72, and to 
[2 over the sequence (nox )72.,. Since 11, 2 are equivalent to the Lebesgue 
measure on [0, 1], the sequence (Z,,)°2, is dense in [0,1]. ) 

Remarks. Sce [1] for some further applications of the above discussion. 
The quantity H(f, 1) or its negative has been known in statistical literature 
as Kullback-Leibler information. In financial mathematics, the quantity 
—H(f, 1) is called the relative entropy with respect to jz, so then one deals 
withf minimizing the relative entropy. 
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1.2.3. Relative Entropy and Conditioning 


Let (Q, B, w) be a probability space, where (Q, B) is a standard Borel space 
(see [7], [9]). For a given countable collection {B,}°2, in B let Q, C, be 
the partition and the o-algebra generated by it. (By partition generated 
by {B,,}°2, we mean the collection: 


Q={q:qd=N21B;'}, 


where ¢; = 0 or 1, and B? = B;, Bt =Q — B;.) 

For any subset A of 2, by saturation of A with respect to Q we mean 
the the union of elements of Q which have non-empty intersection with A. 

It is known that C= {C € B:C a union of elements in Q}. We regard 
C also as a o-algebra on Q. Note that C-measurable functions are the 
functions which are B measurable and which are constant on elements of 
Q. A C-measurable function on 2 is therefore also a C-measurable function 
on Q, and if f is such a function, we regard it both as a function on 2 and 
on Q. We write f(q) to denote the constant value of such a function on 
q,q € Q. In addition to C, we also need a larger o-algebra, denoted by A, 
generated by analytic subsets Q (see [9]). 

Since (Q, B) is a standard Borel space for any probability measure pz on 
(OQ, B) there exists a regular conditional probability given C (equivalently 
disintegration of jz with respect to Q). This means that there is a function 
u(-,-) on B x Q, such that 


1) w(-,q) is a probability measure B, 


2) u(q,g) = 1, 


3) for each A € B, y/(A,-) is measurable with respect to A and 
nA) = f nArate),.n(ae) =f w(A,a)r le (a), 


where q(w) is the element of Q containing w, 


4) if y’(-,-) is another such function then p(-,w) = p’(-,w) for ae. w 
(with respect to j). 
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Note that sets in A are measurable with respect to every probability 
measure on B. Further, we can say that there is a p-null set N which is 
a union of elements of Q and such that (i) C |g_w is a standard Borel 
structure on 2 — N, (ii) for each A € C |p_n, w(A,-) is measurable with 
respect to this Borel structure. 


The function t1(-,-) is called conditional probability distribution of with 
respect to Q, or, the disintegration of 4 with respect to the partition Q. If 
f is a B measurable function with finite integral with respect to w, then the 
function h(w) = I f(y)u(dy,¢q),w € q is called the conditional expectation 
of f with respect to Q (or with respect to C) and denoted by E,,(f | Q) 
or E,(f | C). If Q is the partition induced by a measurable function g, 
E,.(f | Q) is called the conditional expectation of f given g and written 


Eu(f |g). (See [9], p. 209) 


The measure pz is completely determined by p/(-,-) together with the 
restriction of yz to C, denoted by pw |c. We note also that if v’ is any 
probability measure on C then v = Je (-, q)v'(dq) is a measure on B having 
the same conditional distribution (or disintegration) with respect to Q as 
that of p. 

Let y and v be two probability measures on B, with v absolutely con- 
tinuous with respect to yz. Let pu(-,¢),v(-,q),¢ € Q be the disintegration of 
p and v with respect to the partition Q. Then, for a.e. w with respect to 


Ls, 


a dv(-,q) dv |e 
OY sh a, SE 
du du(-,g) dpc 


(w),if w Eq. 


A calculation using this identity shows that 


dv. dv(-,q) dv |c 
a(n) = f HEE neo) le (aa) + HE. ple). (7) 


Assume now that f is a real valued function having finite expectation 
with respect to p, and let g be a real valued function on Q for which there 
is a probability measure v, absolutely continuous with respect to fu, such 
that for all g € Q, 


/ flw)v(dw, @) = 9(@)- (8) 


qd 
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(Note that g is necessarily C measurable.). 


Theorem 2.4. Jf 1% is a probability measure which maximizes H Ce a) 
among all probability measure v, absolutely continuous with respect to uw and 
satisfying (8), then for a.e. q (with respect to us |c), Vo(-,q) maximizes the 
relative entropy H Crant 1(-,q)) as X ranges over all probability measures 
on q absolutely continuous with respect to (-,q) and satisfying 


/ f(w)d(dw) = g(q),¢ € @ 


Proof: Assume in order to arrive at a contradiction, that the theorem is 
false. Then there is a set EF C Q of positive jz |¢ measure and a transition 
probability A(-,-) on B x E such that for each q € E, 


(i) Aq) = 1, 
(ii 


) 

) A(-,qg) is absolutely continuous with respect to p(-,q), 
(iii) ET HO.) < A(Gaee HC a), and 

(iv) 


. a) a dys(-,q)? 


J, f(@)A(dw, a) = 9(4)- 


The existence of such an EF and the transition probability A(-,-) is easy 
to see if the partition Q is finite or countable. In the general case the 
proof relies on some non-trivial measure theory. Define a new transition 
probability on B x Q as follows: For all A € B, 


T(A, 9g) = (A,q) if qe Q-#,T(A,q) =A(A,q) if ge E. 


The measure T defined on B by 
1(A) = | T(A,aW0 le (da), 


is absolutely continuous with respect to yw, T(-,-) is its disintegration with 
respect to Q, T |c= % |c and for each q € Q 


[i T (dw, q) = g(q). 


Finally, by formula (7), and in view of the values of T on E, 


ar dT(-,q) dv \c 
AG = [MG combo le (da) + HCG we) 


14 K. B. Athreya and M. G. Nadkarni 


is strictly bigger than H Car 11) which is equal to 


dv le 


di _ dvo(-, qg) 
(ew) = fA ooa))¥ le (aa) + HCG te) 
dvo 


This contradicts the maximality of H( a jt), and proves the theorem. 


Note that the measures vo(-,¢),q € Q, remain unchanged even if 1% 
maximizes H (3, 1) under the additional constraint that [, f(w)v(dw) = a 
for some fixed a. However, vp |c need not maximize H Guat wu |c) among 
all probability measures m on C satisfying Jo g(q)m(dq) = a. 


1.3. Measure Free Martingales, Weak Martingales, 
Martingales 


1.3.1. Finite Range Case 


In this section we will discuss the notion of measure free martingales, and 
and its relation to the usual martingale. In [6] the simpler case of measure 
free martingale, where functions assume only finitely many values, was 
introduced and we recall it below. 

Let 2 be a non-empty set. Let (f,)°2, be a sequence of real valued 
functions such that each f,, has finite range, say (%p1,2%n2,°** ;Unk,,), and 
these values are assumed on subsets 071, Qn2,°++ ,Qnz,,- These sets form 
a partition of 2 which we denote by P,,. We denote by Q,, the partition 
generated by P, P2,--- ,P, and the algebra generated by Q,, is denoted by 
An. Let A. denote the algebra US, An. 

Define A, measurable functions m,,M, as follows: for Q € Q, and 


weQ, 
Mn (w) = min fn+i(z), 
M,,(w) = max fnii(Z). 
(w) = max fas (2) 
Definition 3.1. The sequence (fp,An)°2@, is said to be a measure free 
martingale or probability free martingale if 


Mn(w) < fn(w) < Mn(w), Vw €, n> 1. 


Clearly, for each Q € Qn, the function f, is constant on Q. We de- 
note this constant by f,(Q). With this notation, it is easy to see that 
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(fn; An)°@, is a measure free martingale or probability free martingale if 
and only if for each n and for each Q € Qn, fn(Q) lies between the mini- 
mum and the maximum values of fn+1(Q’) as Q’ runs over QN Qn4i. It 
is easy to see that if there is a probability measure on A, with respect to 
which (fn, An)°@, is a martingale then (f,,An)?2, is also a measure free 
martingale. Indeed let P be such a measure. Then, for any Q in Qn, fn(Q), 
is equal to 


AO > fav1(Q')P(Q’), 
{Q’EQn+1,Q’CQ} 


so that f,(Q) lies between the minimum and the maximum values f,+1(Q’), 
Q’ € QN Qn. In [6], the following converse is proved. 


Theorem 3.1. Given a measure free martingale (fn,An)°21, there exists 


for each n > 0, a measure P, on An such that 


Peay LA Pas Basal faa | An) = Fa, 


where E,,+1 denotes the conditional expectation with respect to the proba- 
bility measure Py+1. There is a finitely additive probability measure P on 
the algebra Ax such that, for each n, P |4,= Py. Moreover such a P is 
unique if certain naturally occurring entropies are maximized. 


1.3.2. The General Case 


In the rest of this section we will dispense with the requirement that the 
functions f,, assume only finitely many values. 

Let (Q, 8) be a standard Borel space, and let (f;,)?2, be a sequence of 
real valued Borel functions on Q. For each n, let P, = {f,'({w}) : w € R} 
denote the partition of Q generated by f,, and let Q, denote the partition 
generated f1, fo,--- , fn, i-e., Qn is the superposition of P;, P2,--- ,P,. Let 
B,, be the o-algebra generated by fi, fo,--- , fn. Since (Q, B) is a standard 
Borel space, for each n, B,, is the collection of sets in B which can be written 
as a union of elements in Q,,. For g € Qn, fn is constant on g and we denote 
this value by fn(q). The algebra U°2,B, will be denoted by B, and we 
will assume that it generates Bb. Note that we have changed the notation 
slightly. In section 1.3.1 above we denoted an element in Q, by Q, while 
from now we will use the lower case q. 


Definition 3.2. Let (f,,)°2., be a sequence 6B measurable real valued func- 
tions on (2. 
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(i) Say that sequence (f;,)°2, is a measure free martingale if for each n 


and for each q € Qn, fn(q) is in the convex hull of the values assumed 
by fn4i on gq (note that f,41 need not be constant on gq € Q,.) 

(ii) Let 1.6 be a finitely additive probability measure on B,, such that for 
each n < ov, its restriction to B, is countably additive. The sequence 
(fn)r1 is called a weak martingale with respect to v if for each n, 
Fvngs(fn+1 | fn) = fn G.€. Un4i. 

(iii) A weak martingale (f,,)°°, with respect to vo is a martingale if vo 
is countably additive, in which case the countably additive extension 
of V4. to B is denoted by v, and we call (f,)°2, a martingale with 
respect to v. 


Clearly every martingale is a weak martingale, and if (f,)°2, is a weak 
martingale with respect to v4, then for each n we can modify fp on a Vp 
null set so that the resulting new sequence of functions is a measure free 
martingale. Indeed if v,,(-,-) is the conditional probability distribution of 
Vn+1 given Q,, then 


Aa / fnsa(w)Yn(dw,4), Un a €. 9 € Qn. (9) 


At those q’s where the equality in (9) holds f,,(q) lies between the infimum 
and the supremum of the values assumed by f,+1 on g. On the other q’s 
we modify fn41 by simply setting frii(w) = fr(q),w € gq. The modified 
sequence (f,,)°°., is the required measure free martingale. In the converse 
direction we show that every measure free martingale admits a finitely ad- 


ditive measure on B under which it is a weak martingale. 


Proposition 3.1. Let (Q.,B) be a standard Borel space and let Q, C be the 
partition and the o-algebra generated by a countable collection of sets in B. 
Let A be the o-algebra generated by analytic subsets of Q which are unions 
of elements of Q. Let f and g be respectively B andC measurable real valued 
functions on Q such that for each q € Q, g(q) is in the convex hull of the 
values assumed by f on gq. Then there exists a transition probability v(-, -) 
on B x Q such that for each A € B, the function v(A,-) is A measurable, 
while v(-,q) is a probability measure on B supported on at most two points 


of q satisfying 


g(q) = / f(w)v(dw, 9), Va € Q. 


qd 
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Proof: The sets 


Si = {w EQ: fw) < gW)},S2 = we Q: gw) < fw)} 


are in B. For each q € Q, since g(q) is in the convex hull of the values 
assumed by f on qg, both S; and Sz have non-empty intersection with q. 
By von-Neumann selection theorem (see [9], p 199) there exist coanalytic 
sets Cy C $1,C2 C S2 which intersect each gq € Q in exactly one point. For 
each q € Q, let 


wi(q) = S1Nq,we(q) = Song. 
Then 


f(wr(q)) < 9(@) < f(we(q)), 


so that the middle real number g(q) is a unique convex combination of 


f(w1(q)); f(w2(q)). Tf f(wi(q)) = f(we(@)) = g(q) write pi(q) = 1, pe(q) = 


0, otherwise write 


g(a) — f(wr(q)) 
—Flertoy' 2 = Float) = Fla@y 


Then 


pil) f(w1(@)) + p2(q)f(w2(a)) = 9()- 


For each q € Q, let v(-,q) be the probability measure on q with masses 
pi(q),p2(q) at wi(q),we(q) respectively. The sets C1,C2 are co-analytic 
and functions f |c,, f |c, are B |c,,B |c, measurable respectively, whence 
p1(-),p2(-) are A measurable. For any A € B, and gq € Q, 


V(A, q) = pi(g)la(wi(q)) + po(q)la(we(q)), 


whence, for each A € B, v(A,-) is A measurable. The proposition is proved. 


Theorem 3.2. A measure free martingale admits a finitely additive mea- 
sure under which it is weak martingale. 


Proof: Let (fn)°2, be a measure free martingale. Let Qi, Q2, Bi, Bz be 
the partition and the o-algebra generated by fi, fo respectively. Let 4 
be a probability measure on B,. Since, for each q € Qi, fi(q) is in the 
convex hull of the values assumed by f2 on q, by the Proposition 3.1, there 
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is a transition probability 1(-,-) on B x Q; such that for each q € Qi, 
11(q, q) = I and 


[ fon (de.a) = ho. 
For any A € Bg, define 
V2(A) = | (A, qi (dq). 


Then v2 is a countably additive measure on Bg, 


M2 |n=M, Ei.(fo| fi) = fi. 


Having defined 12, it is now clear how to construct v3,V4,--- ,Un,--* such 
that for each n, 


Vn+1 |B,= Yn 


and 
Fvmgi(fn+1 | fn) = fr: 
The finitely additive measure v,, defined on B, by 
Voo(A) = v2 (A), A € By 
satisfies, for each n, 
Yoo |B,= Un 


and (f,)°2, is thus a a weak martingale with respect to v4. This proves 
the theorem. 


It is natural to ask the question as to when is vy. countably additive. 
There is an answer to this. The refining system of partitions (Q,,)°2, as 
is called a filtration. It is said 
to be complete if for every decreasing sequence (gn)°2,, of non-empty ele- 
ments, dn € Qn, their intersection M 


Co 


well as the associated o-algebras (B,,)°°, 


Co 


219n is non-empty. We have: 


Theorem 3.3. Jf the filtration (Qn)°, associated with the measure free 
martingale is complete, then Vo, is countably additive. 
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This is indeed a consequence of the Kolmogorov consistency theorem 


formulated in terms of filtrations which is as follows. Let the filtration 


CO 


(Qn)°21 arise from a sequence of Borel functions (f;,)°2, not necessarily 
a measure free martingale. For 1 <i < j, we define the natural projection 


map IT;; : Q; + Q: by 
Iij(gj=r, ifacr,qeQ,,reQ. 
For each 7 let I; :Q — Q; be defined by 
Il;(w) = q € Qi, if w Eg. 


If gq € Q, andi < n, then f; is constant on g, and we write f;(q) to 
denote this constant value. For each n let 7, be the smallest topology on 
Q,, which makes the map gq — (f1i(q¢), fo(q),--: , fn(q@)),¢ € Qn, continuous. 
We note that maps IT;;,7 < 7 are continuous. The topology 7, generates the 
o-algebra B,. Any probability measure p on B,, is compact approximable 
with respect to this topology, i.e., given B € B, and € > 0, there is a 
set C Cc B, C compact w.r.t. T,, such that u(B—C) < e«. For each n, 
let P, be a countably additive probability measure on B,,. Assume that 
P41, |p, = Py. Define P,.. on USB, by P(A) = P(A), if Ae B,. P is 
obviously finitely additive. 

We have the Kolmogorov consistency theorem in our setting, arrived 
at after a discussion with B. V. Rao, and as pointed out by Rajeeva 
Karandikar, it is proved also in [7]. 


Theorem 3.4. If the the filtration (Q,,)°~, is complete, then P,, is count- 
ably additive. 


Proof: For simplicity, write P for P,.. If P is not countably additive, then 
there exists a decreasing sequence (A,)°, in USL, Bn, with NPL An = 9, 
such that for all n, P(A,;,) > a> 0 for some positive real a. Without loss 
of generality we can assume that for each n, Ay is in B,. We can choose a 
set Cy, C Ai, Cy € By, C, compact with respect to the topology 7, such 
that 


P(A, = C;) = Pi(Ay = C1) < 


a) 


Note that 


AgNnCy € Bo, P(A2N (Ai —C;) < P(A, —C}) =< 


’ 


|S 


20 K. B. Athreya and M. G. Nadkarni 


whence 
3 
P(A2 NC) > hae 


We next choose Cp C Ag NC, Co € Bo, Co compact in the topology 7, 
such that P(A2 NC; — C2) < 7. Then P(A2 — Co) < $+ 7. Note that 


AsTiGe€Bs, P(AgtiCh) > a= Gia 
Proceeding thus we get a decreasing sequence (C;,)°2, such that for all n, 


Cn € Bn, Cn C An, Cn compact in the topology on 7,, and 
P(An— Ca) < G+ ate ta. 

Clearly each C’, is non-empty. For each n choose an element g,, in C,. Since 
C;, is compact the sequence (I1,;q; J #s has a subsequence converging to a 
point in C;,,. By Cantor’s diagonal procedure it is possible to choose the 
sequence (dn); in such a way that for each i, (Ilijq;)72; is convergent in 
the topology 7; to an element p; in Q;. By continuity of the map II;; we 
have IIl,;p; = p; ifi < j, ie., ift <7 then p; C pj. By completeness of the 
filtration we conclude that N%,p; #0. But 


NZ pi C NPC: C NFA: = 8. 


The contradiction proves the theorem. 


Remark. (i) The requirement that the filtration be complete has been 
crucial in the above discussion. Here is an example due to S. M. Srivas- 
tava of a filtration on the real line which is not complete, but the quotient 
topologies are locally compact second countable. Let Q = R, and let, for 
n>1,Q = {{r},r <n, [n, 0co)}, ie., the nth partition Q, consists of all 
singletons less than n together with the interval [n,oo)). Clearly (Qn )?2, 
is a filtration. For each n, the quotient topology on Q,, is isomorphic to 
the usual topology on R. The set C;, = [n, oo) is compact in the quotient 
topology, Cn41 C Cn, but NP,Cp is empty, so the filtration is not com- 
plete. 

(ii) S. Bochner ([3]) has formulated and proved the Kolmogorov consistency 
theorem for projective families. One can derive the above version by proper 
identification of our sets and maps as a projective system, once topologies 
T are described. 

(iii) The totality of finitely additive measure on B,, which render the mea- 
sure free martingale f,,n = 1,2,3,--- into a weak martingale is a convex 
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set whose extreme points are precisely those v, for which 1, is a point 
mass, and for which, for each n, the disintegration v;,(-,-) of Up4i with 
respect to Q, has the property that for each q € Qn, v(-,q) is supported 
on at most two points. 

(iv) We note that measure free martingales have some nice properties not 
shared by the usual martingales. If fn,,n = 1,2,3,--- is a measure free 
martingale, then [f,],n = 1,2,3,---, where [z] means the integral part 
of x, and min{fn, K},k = 1,2,3,---, where K is fixed real number are 
also measure free martingale. In other words, measure free martingales are 
closed under discretization and truncation. 


1.4. Equivalent Martingale Measures 


Let (Q,B) be a standard Borel space and let ~ be a probability measure 
on B. Let (fn)° be a sequence of Borel measurable real valued functions 
on 92, not necessarily a measure free martingale. In this section we discuss 
conditions, necessary as well as sufficient, for there to exist a measure Vv, 
having the same null sets as ys, and which renders the sequence (f;,)°2, a 
martingale. Clearly, if such v exists then we can modify (f,)°2, on a v-null 
set, which is therefore also p-null, so that the new sequence of functions is 
a measure free martingale. Thus a necessary condition for the existence of 
av, equivalent to 4, under which (f,,)°2, is a martingale is that (f,)°21 
admit a modification on a p-null set so that the resulting sequence is a 
measure free martingale. 

Again assume that such a v exists. For each n, let un, Uy, respectively be 
the restriction of ,v to B,. Let fn(-,-),n(-,-) denote the disintegration 
of Un+1,Y¥n+1 With respect to the partition Qn. Since tn+1 and vp+1 have 
the same null sets, for 4, almost every gq € Qn, Mn(-,q) and v;,(-,q) have 
the same null sets, and since 


J frsale)vn dea) = fold), 
q 
it follows that for py a.e. qg 


Un({w Eq: froi(w) < fr(a)}.a) > Oun({w € a: frtiw) = fr(q)},q) > 0. 
(11) 


Thus, if a v equivalent to ys under which (f,,)°°, is a martingale exists, 
then for each n, for Wn a.e. g € Qn, (11) holds. 


22 K. B. Athreya and M. G. Nadkarni 


Fix n, fix a q € Qn, and write m = u,(-,q). Assume that 
J foaa() | m(awe) < 00, 
q 
and that 


m({w €q: fr+i(w) < fr(g)}) > 0,m({w € a: fr4tiv) > fn(q)}) > 9. 


Write 


E= {w ae! fr4iw) = fr(@},F 4 {w € q: fn+i1(w) a fn(q)}, 


— Jn fn41(w)m(dw) Wa Jie fn41(w)m(dw) 
= ~ a= 5 


Note that c < fn(q) < d and with a= F folQ) |p = fola)—e, we have 


a+b=1,ac+ bd= fr(q). 
Define v/,(-,-) as follows: 


dv! (-,q a b 


Note here that q ranges over Q,, so that m will vary with it. Further, 
a,a,b,3,m = [n(-,q) are measurable functions of g, so that v/,(-,-) is a 
transition probability. If m({w € q: fn4i(w) = fn(q)}) = 0, then v7,(-, q) 


and m are equivalent, and 


/ fasi(w)vf, (dis, q) = e+ bd = f,(q). 


In any case, if we set s = {w € q: fnti(w) = fn(q)}, then s € Qn41, so 
that we can speak of Dirac measure d;,} at s, and consider the measure 


Vn(-,q) defined by: 


Yn(-,q) = (1 — m(s))vn (5G) + 545378) 


The measure 1,(-,q) and m have the same null sets, 1;,(-,q) is measurable 


in q, and 


/ fn(w)Yn(dw, 4) = fala) 


q 
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We define inductively, 


y= [1,72 = / 11(-,q)i1(dq), ab »Yn41(-) a | Un(+,q)Un(dq). 
Q Q 
Then for each n, v, is a probability measure on B,,, equivalent to pn, 


Vn+il = Vn, Ew fav | By) — Ts 


where E,, is the conditional expectation operator in (Q, Bn+1,Un+1). 


If the filtration (Q,,)°2, is complete, the naturally defined measure 1, 
on B has a countably additive extension, say v, to all of b. However, in 
general, v need not be equivalent to jz (see example 4.1. below). If there 
are positive constants A and B such that for all n, A < 1 < B, then 
clearly the v.. defined on B, has an extension to B which has the same 
null sets as js. We have proved: 


Theorem 4.1. 


(a) Let (fn)°, be a sequence of Borel measurable functions on (Q, B, 11) 
and let Qn, Bn, tn, Un(-,-) be as above. If there is a probability mea- 
sure v equivalent to ps with respect to which (fn)e 1 is a martingale, 
then (fn)°@1 can be modified on a t1-null set so that the resulting se- 
quence of functions is a measure free martingale. Further, for each 
n, for almost every q € Qn, the sets {w Eq: fn4i(w) < fn(q)}, {w € 
qd: fn(q) = fnti(w)} have positive Un(-,q)- measure. 


(b) If for every n, and for almost every q € Qn, the sets {w Eq: 
fr+iWw) < frl@)} tw © a: fn(Q) < fn4i1(w)} have positive pn(-,q)- 
measure, then for each n we have a Vy, equivalent to fp, such that 
Vay |\B= Ue, Eooalfaxr | Ba) = fa, where E11 stands for the 
conditional expectation operator on (Q, By+1,U¥n41) with respect to 


Bax 


(c) Finally, if there are positive constants A and B such that for all 
n,A< ae < B, then the naturally defined v.. on By has an 
extension v to B which has the same null sets as w and (fn)°21 ts 
a martingale with respect to v. 
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The equivalent martingale measure v in the above theorem is obtained 
by a rather naive modification of yw. Indeed, ,(-,q) is only rescaled over 
the sets {w € q: fn4i(w) < fn(g)}.{w € a: fn(a) > fn+i(w)}, so that 
if Un(-,qg) is not ’well distributed ’ on these sets, then this persists with 
Vn(-,q). This circumstance can be changed if we assume that for each n and 
for each g € Qn, fn4i is bounded on q, in addition to the requirement that 
sets {w Eq: fnai(w) < fn(a)},{w Eq: fn(q) = fn+i(w)} have positive 
Lin(-,q) measure. We know from entropy considerations of section 1.1 that 
there exists a unique Cp, = Cp(q) such that 


ia Fntt (ween (QD in+1 wo), (dw, q) 


dh ecn (DFn+1) 14, (dw, q) = Fala 


The function gq — c,(q) is B, measurable. We set, for each n and for 
each q € Qn, 


eon (2) inti) 


7 Je eon (DFn+1() wy (dw, q 


din (-, ) ) -dpin(-,@); 


Vy = 11, Yn41 = i Un(-, @)Un (dq). 
Q 


We change notation and write v, = B,. The natural finitely additive 
measure Bj, on B renders (f,)°2, into a weak martingale. If we assume 
that there are positive constants A,C such that for all n, A < oo < , 
then B. extends to a countably additive measure B on B, B equivalent to 


yu, and, (f,)°2, is a martingale with respect to B. 
We may summarise this as 


Theorem 4.2. If for each n, for [Wn a.e. gq € Qn, 


(i) fri is bounded on q, 


(ii) w({w Eq: frtiW) < fn(@}.@) > 9, w({w €@: fntiWw) = fr(a)}, a) > 
0, 


then there exists a unique finitely additive measure By, on By. such that 
for each n, 


(a) the restriction By, of Bx to By is countably additive and equivalent to 


Ln, 
(b) Bni1 maximizes the relative entropy H( ge ,Hn+1) among all finitely 


additive probability measures X on By, which render (fn)°, into a 


weak martingale and such that Ay = M1, An equivalent to fn for all n, 
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(c) if there exist constants A,B such that for each n, A < eeu < B, then 
Bo extends to a countably additive measure measure B on B, B and 
v are equivalent, and (fn)°, is a martingale under B. 


Remarks. 1) The measure B, however, need not maximize the relative 
entropy H (F, #4) among all measures \ on B equivalent to and under 
which (f,)°2, is a martingale. 

2) One may call B a Boltzmann measure equivalent to jz and the associated 
sequence (f,)°°, a Boltzmann martingale. 

We now give a more general condition for the existence of a martingale 
measure equivalent to a given py than the one given in Theorem 4.1. (c). Let 
(fn)o2, be a sequence of measurable functions on (Q, B, 1) for which there 
exists a measure V4, on B, such that (i) (fn)°, is a weak martingale with 
respect tO Vo, (ii) for each n, Wy and Vv, are equivalent, (iii) 4 =. 


Now the Radon-Nikodym derivative 7 <x is computed as follows: 


dip, din, Qn dy 
Unt (yy — AYnlss dn) (.,,) av 


= w)—(w),w Edn € Qn. 
Apin+1 dpin(+, Qn) ian ) " 
So, on iteration, we have: 
din4i dv;(-, di) dv; 
w) = (4+ (&)) K —.w €@ €Q,,1=1,2,-+* 2 
Hn+1 ey=N Gua, ai) ) dp sa ‘ 


Since we have chosen 1) = /1, 4 (w) = 1 for all w, so that 


An41 dv;(-, qi) 


Ww) = a = (w),w € gq; € Qi,4 = 1,2,--+ ,n 
Un+1 * dui(-, qi) 


Further, 


dn+1 


Ey 


(Gee | Bu)(w) = Fe) 


where EF, denotes the conditional expectation operator with respect to ju. 
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Indeed for any set A € B,, 


[ PB maw = f ewe ay 
= Itc ata 
=| wf Tey (a 4) tn (ce) 
= fF e)pnlte 


The sequence of functions (gn = a 


non-negative functions on the probability space (Q, B, 4), so converges ju 
a.e. to a function g. If f,g(w)u(dw) = 1, or if the sequence (gn)?2, 
is uniformly integrable, then, by martingale convergence theorem (see [4], 
p. 319), for each n, gn = E,,(g | Bn), equivalently, for each n, gdu |g, = dn 
and so V4, is countably additive and extends to a measure v on B, with 
a = g. Further (f,,)°2, is a martingale with respect to it. If, in addition, 


)ec_, is therefore a martingale of 


n=1 
g>0Oae. p, then v is equivalent to w and (f,)°2, is a martingale with 
respect to it. 
We have proved: 


Theorem 4.3. 
— diy 


(a) The sequence (gn = aii 
martingale of non-negative functions and converges ts a.e. to a 


)ec_, of Radon-Nikodym derivatives is a 
function g. 

(b) If fo g(w)u(dw) = 1 or if the gn’s are uniformly integrable with 
respect to ut, then (fn)°@, is a martingale with respect to v given 
by dv = gdu. If in addition g > 0 a.e. wu, v is equivalent to p. 


(c) If vr, | 1- gold I< oo as. uw theng>0 ae. pu. 


dn+1 


Write 
stnt1) 1Qn) = ff py CRD Gai vile 


Dec npn (+, @) 
We know from formula (7) of section 1.2.3 that 
dVn+1 
dfin+1 


din, 


»Hn+1) | Qn) + A(a Hn). 


Ay +1 
diun+1 


» Mn+1) = A( 


Entropy and Martingale 27 


Iterating we get 
n+l 


5 fd) =a, aH) | Q- 1): 


Ay +1 


ial 
iad 


(Here, when i = 1, Q;-1 = Q which we take to be the trivial partition 


{0, Q}.) 


Since (g, = 2 


dpin 
(gnlogt gn)°, is a submartingale provided, for each n, Ey,(gnlog* gn) is 


finite ([4], p. 296). We assume that this is the case. Then 


jcc, is a martingale with respect to ju, the sequence 


E,(9n log* gn) < Ey(gn+1 log* gn+i), 7 = 1,2,---, 


so that limn—oo Ey (gn log* gn) exists, which may be finite or infinite. We 
assume that this limit is finite, say c. Since (gn logt gn)2@1 is a submartin- 
gale of non-negative functions it has a limit which is jadeed glog* g, since 
(gn)S21 has limit g, 4 a.e. Moreover by Fatou’s lemma, E,,(glogt g) < c. 
Assume that the sequence (gn)°2, is uniformly integrable so that this se- 
quence together with g forms a martingale. From martingale theory ([4], 

p. 296) the sequence (gn log* gn)°, together with the function glog* g is a 
eee of non-negative functions, so, again from martingale theory 
([4], p. 325) we conclude that 


jim Eu(9n log gn) = Ey(g log” g). 


Since gn log” gy, remains bounded independent of n (which is the case at w 
where gn(w) < 1), we also have 


ie | alee aie | glog~ (w) (dw). 
Q Q 


n—- oo 


Thus we have 


Un, dv 
lim (bn) = H(—,n) 
n—Co n 
We have proved: 
Theorem 4.4. 
(a) Assume that the martingale (gn = in » \na1 ts uniformly integrable 


and that Jos gn logt gnfin(dw) < e, ce some real number c. Then 


Titty cg, ( oun fin) exists and we have: 


: dp, = dy; dv 
lim H(—, pn) = dL HGH) | Q:-1) = lm 


n— oo din 
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(b) In addition to the hypothesis and notations of Theorem 4.2, assume 


that the martingale (Pn ec, is uniformly integrable and that the 


dun 
sequence of relative entropies (H(S2*, Hn) 1 remains bounded. 


Then limn_.oo H (a [ln) exists and we have: 


dB —— dB 
lim H (Gn) =u Gaeitin) | Qn 1) = H(—,y). 


n— Co i 
Among all v absolutely continuous with respect to 4 under which 
(fn)P21 ts a martingale and 1, = 4, B is the unique one which 
maximizes, for each n, the relative entropy entropy H Crm btn) 
Example 4.1. Consider R? together with the measure ps = o x o where 
o is the normal distribution with mean zero and variance one. Let Q, be 
the partition {7} x R,a € R. Let f;,i = 1,2 be the co-ordinate maps. The 
partition of R? given by f; is Qi. The distribution v on R? equivalent to 
yu, satisfying E,(fo | fi) = Ev(fe | Q:) = fi, and maximizing the relative 
entropy with respect to yz is the bivariate distribution of (f1, fi + f2), where 
fi, fo are independent with normal distribution of mean zero and variance 
one. 


More generally, let R” be given the measure ps = 0”, the n-fold product 
of o. Let fi, fo,--- , fn be the co-ordinate random variables. Then 


Q; = {{w1,we,--- ws} xR"? : (wi.we,--- ,wi) € R*} 


is the partition of R” given by the functions f), fo,--- , fj. Let v, be the 
measure induced on R” by the vector random variable (f1, f1 + fo,--- , fi + 
fo+---+ fn) where (fi, fo,--- -fn) has distribution wy = 0”. Then, among 
all probability measures \ on R” equivalent to yz and satisfying E)(fi+1 | 
fi) = fi, 1 <4 < n-1, ™ is the unique one which simultaneously maximizes 


di; dr; 
— w) log i(dw), 
[Ew Few) ( de) 


the relative entropies 


1 <i<n, where p;, A; are respectively the measures p and X restricted to 
the the o-algebra B; generated fi, fo,--- , fi. 

Finally, let (i) Q = RN, where N is the set of natural numbers, (ii) « = 
countable product of o with itself, (iii) for each n, f, = projection on the 
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nth co-ordinate space. The partition Q, of Q generated by fi, fo,--: , fn 
is the collection 


{{(w1, w2, +++ Wn)} x REEMA Fs (wy, wo, - ++ ,Wn) € R”.} 


For each n, let v, denote the measure on B,,, induced by the martingale 
fi, fit fas:++ , oo, fi), where f1, fo,-++ , fn are independent random vari- 
able, each with distribution o. Let v. be the measure on the algebra 
Ure_,B, whose restriction to each B, is v,. Then among all probability 
measures A, on B. which satisfies (a) for each n, fin and A, are equiv- 
alent, (b) for each n, Ey, (fnti | fn) = fn, (c) A1 = 9, the measure ve 
is the unique one which maximizes simultaneously the relative entropies 

(gan Hn), 2 = 1,2,---. The extension v of vx, to the Borel o-algebra of 
RN is, however, singular to pu. 
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Chapter 2 


Marginal Quantiles: Asymptotics for Functions of 
Order Statistics 


G. Jogesh Babu 


Department of Statistics, 
826 Joab L. Thomas Building, 
The Pennsylvania State University, 
University Park, PA 16802-2111, USA* 


Methods for quantile estimation based on massive streaming data are 
reviewed. Marginal quantiles help in the exploration of massive multi- 
variate data. Asymptotic properties of the joint distribution of marginal 
sample quantiles of multivariate data are also reviewed. The results in- 
clude weak convergence to Gaussian random elements. Asymptotics for 
the mean of functions of order statistics are also presented. Application 
of the latter result to regression analysis under partial or complete loss 
of association among the multivariate data is described. 


2.1. Introduction 


Data depth provides an ordering of all points from the center outward. 
Contours of depth are often used to reveal the shape and structure of mul- 
tivariate data set. The depth of a point x in a one-dimensional data set 
{%1,%2,-+* ,%} can be defined as the minimum of the number of data 
points on one side of x (cf. [10]). 

Several multidimensional depth measures D,,(x;21,--- , 2p) for x € R* 
were considered by many that satisfy certain mathematical conditions. If 
the data is from a spherical or elliptic distribution, the depth contours are 
generally required to converge to spherical or elliptic shapes. In this paper 
we concentrate on marginal quantiles. They help in describing percentile 
contours, which lead to a description of the densities and the multivariate 
distributions. 

This approach is useful in quickly exploring massive datasets that are 
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becoming more and more common in diverse fields such as Internet traffic, 
large sky surveys etc. For example, several ongoing sky surveys such as 
the Two Micron All Sky Survey and the Sloan Digital Sky Survey are 
providing maps of the sky at infrared and optical wavelengths, respectively 
generating data sets measured in the tens of Terabytes. These surveys are 
creating catalogs of objects (stars, galaxies, quasars, etc.) numbering in 
billions, with up to a hundred measured numbers for each object. Yet, this 
is just a fore-taste of the much larger datasets to come from surveys such 
as Large Synoptic Survey Telescope. This great opportunity comes with 
a commensurate technological challenge: how to optimally store, manage, 
combine, analyze and explore these vast amounts of complex information, 
and to do it quickly and efficiently? It is difficult even to compute a median 
of massive one dimensional data. As the multidimensional case is much 
more complex, marginal quantiles can be used to study the structure. 

In this review article we start with description of estimation methods for 
quantiles and density for massive streaming data. Then describe asymptotic 
properties of joint distribution of marginal sample quantiles of multivariate 
data. We conclude with recent work on asymptotics for the mean of func- 
tions of order statistics and their applications to regression analysis under 
partial or complete loss of association among the multivariate data. 


2.1.1. Streaming Data 


As described above, massive streaming datasets are becoming more and 
more common. The data is in the form of a continuous stream with no 
fixed size. Finding trends in these massive size data is very important. 
One cannot wait till all the data is in and stored for retrieval for statistical 
analysis. Even to compute median from a stored billion data points is not 
feasible. In this case one can think of the data as a streaming data and 
use low storage methods to continually update the estimate of median and 
other quantiles ({2] and [7]). Simultaneous estimation of multiple quantiles 
would aid in density estimation. 

Consider the problem of estimation of p-th quantile based on a very 
large dataset with n points of which a fixed number, say m, points can be 
placed into memory for sorting and ranking. Initially, each of these points 
is given a weight and a score based on p. Now a new point from the dataset 
is put in the array and all the points in the existing array above it will have 
their ranks increased by 1. The weights and scores are updated for these 
m+ 1 points. The point with the largest score will then be dropped from 


Marginal Quantiles 33 


the array, and the process is repeated. Once all the data points are run 
through the procedure, the data point with rank closest to np will be taken 
as an estimate of the p-th quantile. See [7] for the details. 

Methods for estimation of several quantiles simultaneously are needed 
for the density estimation when the data is streaming. The method devel- 
oped by [8] uses estimated ranks, assigned weights, and a scoring function 
that determines the most attractive candidate data points for estimates of 
the quantiles. The method uses a small fixed storage and its computation 
time is O(n). Simulation studies show that the estimates are as accurate 
as the sample quantiles. 

While the estimated quantiles are useful and informative on their own, 
it is often more useful to have information about the density as well. The 
probability density function can give a more intuitive picture of such char- 
acteristics as the skewness of the distribution or the number of modes. Any 
of the many standard curve fitting methods can now be employed to obtain 
an estimate of the cumulative distribution function. 

The procedure is also useful in the approximation of the unknown un- 
derlying cumulative distribution function by fitting a cubic spline through 
the estimates obtained by this extension. The derivative of this spline fit 
provides an estimate of the probability density function. 

The concept of convex hull peeling is useful in developing procedures 
for median in 2 or more dimensions. The convex hull of a dataset is the 
minimal convex set of points that contains the entire dataset. The convex 
hull based multivariate median is obtained by successively peeling outer 
layers until the dataset cannot be peeled any further. The centroid of the 
resulting set is taken as the multivariate median. Similarly, a multivariate 
interquartile range is obtained by successively peeling convex hull surfaces 
until approximately 50% of the data is contained within a hull. This hull 
is then taken as the multivariate interquartile range. See [5] for a nice 
overview. This procedure requires assumptions on the shape of the density. 
To avoid this one could use joint distribution of marginal quantiles to find 
the multidimensional structure. 


2.2. Marginal Quantiles 


Babu & Rao (cf. [3]) derived asymptotic results on marginal quantiles and 
quantile processes. They also developed tests of significance for population 
medians based on the joint distribution of marginal sample quantiles. Joint 
asymptotic distribution of the sample medians was developed by [9]; see 
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also [6], where they assume the existence of the multivariate density. On 
the other hand Babu & Rao work with a much weaker assumption, the 
existence of densities of univariate marginals alone. 


2.2.1. Joint Distribution of Marginal Quantiles 


Let F denote a k-dimensional distribution function and let F; denote the 
j-th marginal distribution function. The quantile functions of the marginals 
are defined by: 


F>*(u) =mi{ei re) >a}, for 0 wad, 


Thus F>*(u) is u-th quantile of the jth marginal. 

Let X1,...,X,, be independent random vectors with common distribu- 
tion F, where X; = (Xi1,..., Xix). Hence F; is the distribution of X;;. To 
get the joint distribution of sample quantiles, let 0 < q,...,q% <1. Let 
6; denote the density of Fj at F-*(q;) and let 6; denote the qj-th sam- 
ple quantile based on the j-th coordinates X1;,...,Xn; of the sample. [3] 
obtained the following theorem. 


Theorem 2.1. Let F; be twice continuously differentiable in a neighborhood 
of F* (qa) and 6; >0. Then the asymptotic distribution of 

Vn(6, — Fy (qi), ---, 9% — Fy ‘(qax)) 
is k-variate Gaussian distribution with mean vector zero and variance- 
covariance matrix 4 given by 

m(l—q)d;7 a1 + O1k 

y= Doses 
Ck on. t+ Qk(L—qn)d,” 

where for i # 5, o13 = (Fig(Fy (a), Fy *(q)) — G95) / (5:65). 


The proof uses Bahadur’s representation of the sample quantiles (see 


[4]). 


In practice o;; can be directly estimated using bootstrap method, 
ij = E*(n(G; — 0:)(; — 43), 


where 6; denotes the bootstrapped marginal sample quantile and E* de- 
notes the expectation under the bootstrap distribution function. An ad- 
vantage of the bootstrap procedure is that it avoids density estimation 
altogether. 
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2.2.2. Weak Convergence of Quantile Process 


We now describe the weak limits of the entire marginal quantile processes. 
For (q1,---, 4k) € (0,1)*, define the sample quantile process, 


Zrldas- +54) = VR (51(1 — Fy *(qu))s- +++ 5e(Ox — Fy *(ax))) - 


The following theorem is from Section 4 of [3]. 


Theorem 2.2. Suppose for 7 =1,...,k, the marginal d.f. F; is twice dif- 
ferentiable on (a;,b;), where 
—oo < a; = sup{x: F;(x) = 0} 
oo. > bp = Initia 3 Fy te) = 1). 


Further suppose that the first two derivatives F; and F\’ of F; satisfy the 
conditions 


F; #0 on (a;,6;), 


max sup Ff;(z)(1- ney Ol = OO, 
1Sj5h a, cacy (FY (x))? 


and F is non-decreasing (non-increasing) on an interval to the right of 
a; (to the left of b;). Then Zn(q,---,d%) converges weekly to a Gaussian 
random element (W1,...,W,) on C[0, 1]*. 


Thus, each marginal of Z,, converges weakly to a Brownian bridge. The 
covariance of the limiting Gaussian random element is given by 


E(W,(t)W;(s)) = P(Fi(Xiui) < t, Fji(X1j) < 8) — ts. 


2.3. Regression under Lost Association 


[11] developed a method of estimation of linear regression coefficients when 
the association among the paired data is partially or completely lost. He 
considered the simple linear regression problem 


¥;=a-+ BU; + &, 


where U; are independent identically distributed (i.i.d.) with mean jz and 
standard deviation oy, the residual errors €; are i.i.d with mean zero and 


36 G. J. Babu 


standard deviation o,. Further, {U;} and {e;} are assumed to be indepen- 
dent sequences. If II,, denotes the set of all permutations of {1,...,n}, 
then it is natural to find estimators a, @ of a, @ that minimize 


(Yi) — @ — BU;)?. 


we 


h(a, 8) = min 


ar 1 


[11] has shown that the permutation that minimizes h is free from a, (2. 

The main difficulty is the computational complexity. As there are n! 
permutations, conceivably it requires that many computations. [11] has 
shown that B depends only on two permutations. In particular, he has 
shown that 


nm n 
= > U@Ya and ~ Uw ¥inaty 
i=1 i=1 
appear in the definition of B. Hence the results on their limits are needed 
to obtain the asymptotics for B. This would aid in the estimation of the 
bias of B. Further testing of hypothesis or obtaining confidence intervals 
for B require limiting distribution of 


| n 
= VVwWXw- 
vn i=l 


See the Example 2.2 in the last section. 

In the next section we present some work in progress on the strong law 
of large numbers and central limit theorems for means of general functions 
of order statistics. These results would aid in establishing 


B+ B, = sign(8)\/B? + o20G'. 


2.4. Mean of Functions of Order Statistics 


This section is based on the current research by [1]. We present some recent 
results on strong law of large numbers and the central limit theorem for the 
means of functions of order statistics. Let X; and X;; be as in Section 
2.2.1. Let x®) denote the i-th order statistic of {X1;,..., Xnj}. Suppose 
¢ is a measurable function on R* and the function y defined by 


y(u) = O(F,*(u),..-,F,'(u)), 0<u<1, 


is integrable on (0, 1). 
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Theorem 2.3. Suppose F; are continuous, o(F,‘(u1),.--,F, '(ur)) is 
continuous in the neighborhood of the diagonal uy = u,...,upn = u,0 < 
u <1, and for some A and 0 < c < 1/2, 


k 
OF *(u1),--- Fe (ue) SAL 1+ Do ly) ] 
j=l 
whenever (u1,-.-,Uk) € (0,c0)* U (1 — 9, 1)*. Then 
1 ae. fo 
25 Cn Meee Oe 3 | y(y)dy 
0 


For example, in the two dimensional case, 
o(Fy*(u), Fy *(v)) = min(u, v)~*(1 — max(u, v))~° 


with 0 <a< - satisfies the conditions of Theorem 2.3. 


To establish asymptotic normality, we require 
lim, Vu(ly(u)| + — 9) = 0. 
square integrability of partial derivatives w,, 


0¢(F,*(u1),...,F,+(u 
vj (u) = OY 


and some smoothness conditions on 7; and $(F,‘(u1),...,F, (ux), in 
addition to the conditions of Theorem 2.3. 


Theorem 2.4. Assume for any pair (1 <j Ar <k), the joint distribution 
F;, of (Xij,Xir) is continuous. Under regularity assumptions that include 
the conditions mentioned above, we have 


1 = . dis 
3 BES 5051, 8) | v(y)dy) #2, N(0, 02), 
_ O 


where 


1 1 
a2 ff eye), Few) — eulds(oybj(o)dedy 


1<jAr<k 


k 1 y 
+ 2» [ ‘ ao(1 — y) by (2)%j (y) dy. 


Details are in [1]. 
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2.5. Examples 


The above results are illustrated with two of examples. 


Example 2.1. Let X; be as in Section 2.2.1. Let the marginals X;; be 


uniformly distributed and let (u1,...,ug) = uf’ ---ug*, for some a; > 1. 
Then 
— > (xa. (XM) a6 28 
i At - a1 +---+a,4+1’ 
i » XQ (XA) ya 1 | dist N(O o”) 
Jn - sia 3 i ti!) i a 1 
where 
k 
(2M — 3)(M? — 2) 
a =2 YS) ajapE(X1jXir)™ + IM +1 Sa ; 
1<j<r<k j=l 


and M=a,+-:-+ ag. 
Note that the limit in this example does not depend on the joint distribution 
of X1;. In particular, if a; = az = 1, we obtain that both + 7", aap ae 


2 
and + v7, bes) converge to the same limit E(X?,) = 4 ae. 


Example 2.2. (Regression with lost associations.) Let {(Xi,Y;),1 <i< 
n} be i.i.d. bivariate normal random vectors with correlation p, means 
/41, 42, and standard deviations 01,02. Let the marginal distributions of 
X, and Y; be denoted by F' and G. Clearly, 


G-1(F(2)) = ws + —(e — 111). 


Then 
1 
= oa XniYai > | FO" (u)G7}(u) du = E(X1G71(F(X1))) 
0 
= [12 + 0102, 
and 
1 dist. 2 
Fa (Xn — [41 }l2 — 0102) —> N(0,0 ys 
Vn 
where 


oO = phos + poy + (1+ p* ojo + 2p pip2 0102. 
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Regression under broken samples are considered in Section 2.3, where it is 


indicated that the regression coefficients depend only on 


i i 
= Xnilnii i Xn; (n—4 ‘ 
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This article provides an exposition of recent developments on the analysis 
of landmark based shapes in which a k-ad, i.e., a set of k points or 
landmarks on an object or a scene, are observed in 2D or 3D, for purposes 
of identification, discrimination, or diagnostics. Depending on the way 
the data are collected or recorded, the appropriate shape of an object is 
the maximal invariant specified by the space of orbits under a group G 
of transformations. All these spaces are manifolds, often with natural 
Riemannian structures. The statistical analysis based on Riemannian 
structures is said to be intrinsic. In other cases, proper distances are 
sought via an equivariant embedding of the manifold M in a vector space 
FE, and the corresponding statistical analysis is called extrinsic. 


3.1. Introduction 


Statistical analysis of a probability measure Q on a differentiable manifold 
M has diverse applications in directional and axial statistics, morphomet- 
rics, medical diagnostics and machine vision. In this article, we are mostly 
concerned with the analysis of landmark based data, in which each obser- 
vation consists of k > m points in m-dimension, representing / locations on 
an object, called a k-ad. The choice of landmarks is generally made with 
expert help in the particular field of application. The objects of study can 
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be anything for which two k-ads are equivalent modulo a group of transfor- 
mations appropriate for the particular problem depending on the method 
of recording of the observations. For example, one may look at k-ads mod- 
ulo size and Euclidean rigid body motions of translation and rotation. The 
analysis of shapes under this invariance was pioneered by [27, 28] and [13]. 
Bookstein’s approach is primarily registration-based requiring two or three 
landmarks to be brought into a standard position by translation, rotation 
and scaling of the k-ad. For these shapes, we would prefer Kendall’s more 
invariant view of a shape identified with the orbit under rotation (in m- 
dimension) of the k-ad centered at the origin and scaled to have unit size. 
The resulting shape space is denoted 7’. A fairly comprehensive account 
of parametric inference on these manifolds, with many references to the 
literature, may be found in [21]. The nonparametric methodology pursued 
here, along with the geometric and other mathematical issues that accom- 
pany it, stems from the earlier work of [9-11]. 

Recently there has been much emphasis on the statistical analysis of 
other notions of shapes of k-ads, namely, affine shapes invariant under affine 
transformations, and projective shapes invariant under projective transfor- 
mations. Reconstruction of a scene from two (or more) aerial photographs 
taken from a plane is one of the research problems in affine shape analysis. 
Potential applications of projective shape analysis include face recognition 
and robotics — for robots to visually recognize a scene ((36], [1]). 

Examples of analysis with real data suggest that appropriate nonpara- 
metric methods are more powerful than their parametric counterparts in 
the literature, for distributions that occur in applications ([7]). 

There is a large literature on registration via landmarks in functional 
data analysis (see, e.g., [12], [43], [37]), in which proper alignments of curves 
are necessary for purposes of statistical analysis. However this subject is 
not closely related to the topics considered in the present article. 

The article is organized as follows. Section 3.2 provides a brief expos- 
itory description of the geometries of the manifolds that arise in shape 
analysis. Section 3.3 introduces the basic notion of the Fréchet mean as 
the unique minimizer of the Fréchet function Fp), which is used here to 
nonparametrically discriminate different distributions. Section 3.4 outlines 
the asymptotic theory for extrinsic mean, namely, the unique minimizer of 
the Fréchet function F(p) = Jy, ?(p, )Q(dx) where p is the distance inher- 
ited by the manifold M from an equivariant embedding J. In Section 3.5, 
we describe the corresponding asymptotic theory for intrinsic means on 
Riemannian manifolds, where p is the geodesic distance. In Section 3.6, 
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we apply the theory of extrinsic and intrinsic analysis to some manifolds 
including the shape spaces of interest. Finally, Section 3.7 illustrates the 
theory with three applications to real data. 


3.2. Geometry of Shape Manifolds 


Many differentiable manifolds M naturally occur as submanifolds, or sur- 
faces or hypersurfaces, of a Euclidean space. One example of this is the 
sphere S¢ = {p € R@*1?: ||p|| = 1}. The shape spaces of interest here are 
not of this type. They are generally quotients of a Riemannian manifold NV 
under the action of a transformation group. A number of them are quotient 
spaces of N = S@ under the action of a compact group G, i.e., the elements 
of the space are orbits in S% traced out by the application of G. Among im- 
portant examples of this kind are axial spaces and Kendall’s shape spaces. 
In some cases the action of the group is free, i.e., gp = p only holds for the 
identity element g = e. Then the elements of the orbit O, = {gp: g € G} 
are in one-one correspondence with elements of G, and one can identify the 
orbit with the group. The orbit inherits the differential structure of the 
Lie group G. The tangent space T,.N at a point p may then be decom- 
posed into a vertical subspace of dimension that of the group G along the 
orbit space to which p belongs, and a horizontal one which is orthogonal 
to it. The projection 7, m(p) = Op is a Riemannian submersion of N onto 
the quotient space N/G. In other words, (dr(v),da(w))z(p) = (V,W)p for 
horizontal vectors v,w € T,N, where dx : T,N — T,(p)N/G denotes the 
differential, or Jacobian, of the projection 7. With this metric tensor, N/G 
has the natural structure of a Riemannian manifold. The intrinsic analy- 
sis proposed for these spaces is based on this Riemannian structure (See 
Section 3.5). 

Often it is simpler both mathematically and computationally to carry 
out an extrinsic analysis, by embedding M in some Euclidean space E* = 
R*, with the distance induced from that of E*. This is also pursued when 
an appropriate Riemannian structure on M is not in sight. Among the 
possible embeddings, one seeks out equivariant embeddings which preserve 
many of the geometric features of M. 


Definition 3.1. For a Lie group H acting on a manifold MW, an embedding 
J:M — R* is H-equivariant if there exists a group homomorphism 
@: H — GL(k,R) such that 


J(hp) = o(h)J(p) Vp € M, Vh ee H. (2.1) 
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Here GL(k,R) is the general linear group of all k x k non-singular matrices. 


3.2.1. The Real Projective Space RP4% 


This is the axial space comprising axes or lines through the origin in R¢*!. 
Thus elements of RP? may be represented as equivalence classes 


gee es..." |= Deis), eR) {0 (2.2) 


One may also identify RP? with S¢ /G, with G comprising the identity map 
and the antipodal map p+> —p. Its structure as a d-dimensional manifold 
(with quotient topology) and its Riemannian structure both derive from this 
identification. Among applications are observations on galaxies, on axes of 
crystals, or on the line of a geological fissure ([42], [35], [22], [3], [29]). 


3.2.2. Kendall’s (Direct Similarity) Shape Spaces U*, 


Kendall’s shape spaces are quotient spaces S¢/G, under the action of the 
special orthogonal group G = SO(m) of m x m orthogonal matrices with 
determinant +1. For the important case m = 2, consider the space of 
all planar k-ads (21, 22,..., 2%) (2; = (xj,y;)), k > 2, excluding those 
with k identical points. The set of all centered and normed k-ads, say 
u = (ui, U2,...,Uz) Comprise a unit sphere in a (2k — 2)-dimensional vec- 
tor space and is, therefore, a (2k — 3)-dimensional sphere $?*~%, called the 
preshape sphere. The group G = SO(2) acts on the sphere by rotating 
each landmark by the same angle. The orbit under G of a point wu in the 
preshape sphere can thus be seen to be a circle S', so that Kendall’s planer 
shape space )§ can be viewed as the quotient space S?*~3/G ~ $?*—3/S1, 
a (2k — 4)-dimensional compact manifold. An algebraically simpler repre- 
sentation of 54 is given by the complex projective space CP*~?, described 
in Section 3.6.4. For many applications in archaeology, astronomy, morpho- 
metrics, medical diagnosis, etc., see [14, 15], [29], [21], [10, 11], [7] and [39]. 


3.2.3. Reflection (Similarity) Shape Spaces RX*, 


Consider now the reflection shape of a k-ad as defined in Section 3.2.2, but 
with SO(m) replaced by the larger orthogonal group O(m) of all m x m 
orthogonal matrices (with determinants either +1 or —1). The reflection 
shape space RU* 


7, 18 the space of orbits of the elements u of the preshape 
sphere whose columns span R”’. 
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3.2.4. Affine Shape Spaces AXE, 


The affine shape of a k-ad in R™ may be defined as the orbit of this k-ad 
under the group of all affine transformations «+> F(a) = Ax + b, where 
A is an arbitrary m x m non-singular matrix and 0 is an arbitrary point 
in R™. Note that two k-ads x = (a1,..., 2%) and y = (y1,.--, Yk), (2, Ys 
€ R” for all 7) have the same affine shape if and only if the centered k-ads 
U = (U1, U2,..., UK) = (41 —Z,..., 2% —Z£) and v = (v1, vo,..., Ue) = (y1 — 
Y,---;Yk — Y) are related by a transformation Au = (Auy,..., Aux) = v. 
The centered k-ads lie in a linear subspace of R™ of dimension m(k — 
1). Assume k > m-+1. The affine shape space is then defined as the 
quotient space H(m,k)/GL(m, R), where H(m,k) consists of all centered 
k-ads whose landmarks span R™, and GL(m,R) is the general linear group 
on R™ (of all m x m nonsingular matrices) which has the relative topology 
(and distance) of R”™ and is a manifold of dimension m2. It follows that 
AX*, is a manifold of dimension m(k — 1) — m?. For u,v € H(m,k), since 
Au = v iff u’A’ = v’, and as A varies u’ A’ generates the linear subspace 
L of H(m,k) spanned by the m rows of u. The affine shape of u, (or 
of x), is identified with this subspace. Thus AD*, may be identified with 
the set of all m dimensional subspaces of R*~!, namely, the Grassmannian 
Gin(k — 1)-a result of [40] (Also see [16], pp. 63-64, 362-363). Affine shape 
spaces arise in certain problems of bioinformatics, cartography, machine 
vision and pattern recognition ([4, 5], [38], [40]). 


3.2.5. Projective Shape Spaces PS. 


For purposes of machine vision, if images are taken from a great distance, 
such as a scene on the ground photographed from an airplane, affine shape 
analysis is appropriate. Otherwise, projective shape is a more appropriate 
choice. If one thinks of images or photographs obtained through a central 
projection (a pinhole camera is an example of this), a ray is received as a 
point on the image plane (e.g., the film of the camera). Since axes in 3D 
comprise the projective space RP?, k-ads in this view are valued in RP?. 
Note that for a 3D k-ad to represent a k-ad in RP?, the corresponding 
axes must all be distinct. To have invariance with regard to camera angles, 
one may first look at the original noncollinear (centered) 3D k-ad u and 
achieve affine invariance by its affine shape (i.e., by the equivalence class Au, 
A € GL(3,R)), and finally take the corresponding equivalence class of axes 
in RP? to define the projective shape of the k-ad as the equivalence class, 
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or orbit, with respect to projective transformations on RP?. A projective 
shape (of a k-ad) is singular if the k axes lie on a vector plane (RP?). For 
k; > 4, the space of all non-singular shapes is the 2D projective shape space, 
denoted PoXk. 

In general, a projective (general linear) transformation a on RP” is 
defined in terms of an (m+1)x(m-+1) nonsingular matrix A € GL(m-+1, R) 
by 


ala iaalles...2a°?")) = (AG, .<.ge", (2.3) 


where x = (z!,...,2™*!) € R™*! \ {0}. The group of all projective 
transformations on RP™ is denoted by PGL(m). Now consider a k-ad 
(yi,---, Yk) in RP™, say y; = [z;| (G =1,...,k), k > m+2. The projec- 
tive shape of this k-ad is its orbit under PGL(m), ie., {(ayi,...,aYyK): @ € 
PGL(m)}. To exclude singular shapes, define a k-ad (y1,..-,Yk) = 
([v1],..-,[a%]) to be in general position if the linear span of {y1,..., yx} is 
RP", i.e., if the linear span of the set of k representative points {x71,..., 7%} 
in R™*! is R™*!. The space of shapes of all k-ads in general position is the 
projective shape space Pyd*,. Define a projective frame in RP™ to be an or- 
dered system of m+ 2 points in general position. Let I = 71 <... < im+e2 
be an ordered subset of {1,...,k}. A manifold structure on P;*,, the 


open dense subset of PoX*,, of k-ads for which (yi1,---, Yi, 42) is a projec- 
tive frame in RP, was derived in [36] as follows. The standard frame is 
defined to be ({e1],..., [€m-+i], [er te2 +---+€m4i]), where e; € R™*! has 


1 in the j-th coordinate and zeros elsewhere. Given two projective frames 
(p1,---;Pm+2) and (q1,---,;Qm-+2), there exists a unique a € PGL(m) such 
that a(p;) = qj (j = 1,...,k). By ordering the points in a k-ad such that 
the first m+ 2 points are in general position, one may bring this ordered 
set, say, (P1,---,;Pm+2), to the standard form by a unique a € PGL(m). 
Then the ordered set of remaining k — m — 2 points is transformed to a 
point in (RP™)*-™~?. This provides a diffeomorphism between P;D*, and 
the product of k — m — 2 copies of the real projective space RP”. 

We will return to these manifolds again in Section 3.6. Now we turn to 
nonparametric inference on general manifolds. 


3.3. Fréchet Means on Metric Spaces 


Let (M, p) be a metric space, p being the distance, and let f > 0 be a given 
continuous increasing function on [0,00). For a given probability measure 


Statistics on Manifolds AT 


Q on (the Borel sigma field of) M, define the Fréchet function of Q as 


F(p) = [ f(olp,2))Q(de), pe M. (3.1) 


Definition 3.2. Suppose F'(p) < co for some p € M. Then the set of all p 
for which F'(p) is the minimum value of F' on M is called the Fréchet Mean 
set of Q, denoted by Cg. If this set is a singleton, say {ur}, then pr is 
called the Fréchet Mean of Q. If X1, X2,..., Xn are independent and iden- 
tically distributed (iid) M-valued random variables defined on some proba- 
bility space (Q, 7, P) with common distribution Q, and Q, = * ar bx; 
is the corresponding empirical distribution, then the Fréchet mean set of 
Qn is called the sample Fréchet mean set, denoted by Cg,,. If this set is a 
singleton, say {jr,}, then pr, is called the sample Fréchet mean. 


Proposition 3.1 proves the consistency of the sample Fréchet mean as 
an estimator of the Fréchet mean of Q. 


Proposition 3.1. Let M be a compact metric space. Consider the Fréchet 
function F of a probability measure given by (3.1). Given any € > 0, there 
exists an integer-valued random variable N = N(w,e) and a P-null set 
A(w,€) such that 


Co, CCo = {pe M: p(p,Ca) <e}, Vn =>N (3.2) 


outside of A(w,e). In particular, if Cg = {ur}, then every measurable 
selection, up, from Cg,, is a strongly consistent estimator of Up. 


Proof. For simplicity of notation, we write C = Cg, Cn = Ca, U= UF 
and [in = Ur,,. Choose € > 0 arbitrarily. If C’ = M, then (3.2) holds with 
N =1. If D= M \ C* is nonempty, write 


l=min{F(p):p€ M} = F(q) Va EC, 
1+ d(e) = min{F(p) :p € D}, d(e) > 0. (3.3) 


It is enough to show that 


max{|F;,(p) — F(p)|:p € M} —>0as., as n > 00. (3.4) 
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For if (3.4) holds, then there exists N > 1 such that, outside a P-null set 
A(w,€), 


min{F,(p):pEC}<I+ — 


min{F,(p):pe D}>l+ = Yn > N. (3.5) 


Clearly (3.5) implies (3.2). 
To prove (3.4), choose and fix e’ > 0, however small. Note that Vp, p’, 
xzeEM, 


|e(p, x) — p(p', x)| < p(p, p’). 


Hence 


|F (p) — F(p')| < max{|f(o(p,2)) — f(e(p’, z))| : 2 € M} 
< max{|f(u) — f(u’)| : |u—u'| < p(p,p’)}, 
|Fn(p) — Fn(p’)| < max{|f(u) — f(u’)|:|u-u'| < plp,p’)}. (3.6) 


Since f is uniformly continuous on [0, R] where R is the diameter of M, so 
are F and F,, on M, and there exists 6(e’) > 0 such that 


/ / 


€ 
IF(p) — F')| < Z, |Fn(p) — Fr(P’)| S 
if p(p, p’) < d(e’). Let {q1,..-, qn} be a d(e’)—net of M, ie., Vp € M there 
exists q(p) € {qi,-.-,qx} such that p(p,q(p)) < 6(é’). By the strong law 
of large numbers, there exists an integer-valued random variable N(w, «’) 
such that outside of a P-null set A(w,’), one has 


n 


(3.7) 


>| 


\Fn(ai) — F(qai)| < — Vi=1,2,...,k; ifn > Nw, €’). (3.8) 


From (3.7) and (3.8) we get 


IF (p) — Fn(p)| < |F(p) — F(a(p))| + |F(a(p)) — Fr(a(p))| 


+ |Fn(a(p)) — Fn(p)| 
/ 
< - <eé', Woe M, 


ifn > N(w,e’) outside of A(w,¢’). This proves (3.4). O 
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Remark 3.1. Under an additional assumption guaranteeing the existence 
of a minimizer of Ff’, Proposition 3.1 can be extended to all metric spaces 
whose closed and bounded subsets are all compact. We will consider such an 
extension elsewhere, thereby generalizing Theorem 2.3 in [10]. For statisti- 
cal analysis on shape spaces which are compact manifolds, Proposition 3.1 
suffices. 


Remark 3.2. One can show that the reverse of (3.2) that is “Ca C Co, V 
n > N(w,e)” does not hold in general. See for example Remark 2.6 in [10]. 


Remark 3.3. In view of Proposition 3.1, if the Fréchet mean pr of Q 
exists as a unique minimizer of F’, then every measurable selection of a 
sequence pip, € Cg, (n > 1) converges to zr with probability one. In the 
rest of the paper it therefore suffices to define the sample Fréchet mean as 
a measurable selection from Cg,, (n > 1). 


Next we consider the asymptotic distribution of wr,. For Theorem 3.1, 
we assume WM to be a differentiable manifold of dimension d. Let p be a 
distance metrizing the topology of /. The proof of the theorem is similar 
to that of Theorem 2.1 in [11]. Denote by D,. the partial derivative w.r.t. 
the r* coordinate (r = 1,...,d). 


Theorem 3.1. Suppose the following assumptions hold: 
Al Q has support in a single coordinate patch, (U,¢). [@ : U —+ R¢ 
smooth} Let Vp —= Ol Xa )y J = Ajay sgt 
A2 Fréchet mean tur of Q is unique. 
A3 Vr, y — h(x, y) = (p%)? (x,y) = p?2(o- 12, d+) is twice continuously 
differentiable in a neighborhood of $(ur) = pL. 
Ad E{D,A(Y, p)}? < 00 Vr. 
A5E{ sup |D,D,A(Y,v) —D,D-hA(Y,u)|} ~ 0 ase -0V 7,8. 
u—v|<eE 
A6 ie (( E(D,D,h(¥. )} )) is nonsingular. 
A7X = Cov(grad h(Yj, 11)| is nonsingular. 
Let Urn be a measurable selection from the sample Frechet mean set. Then 
under the assumptions A1-A7, 


Vii(itn — w) + N(0,A72D(A')-). (3.9) 
3.4. Extrinsic Means on Manifolds 


From now on, we assume that M is a Riemannian manifold of dimension d. 
Let G be a Lie group acting on M and let J: M — EN be a H-equivariant 


50 R. Bhattacharya and A. Bhattacharya 


embedding of M into some euclidean space E% of dimension N. For all our 
applications, H is compact. Then J induces the metric 


p(a,y) = l|J(@) — J(y)l| (4.1) 


on M, where |j.|| denotes Euclidean norm (|lu||2) = SO’, ui2 Vu = 


(ui, U2,..,un)). This is called the extrinsic distance on M. 

For the Fréchet function F in (3.1), let f(r) = r? on (0,00). This choice 
of the Fréchet function makes the Frechet mean computable in a number of 
important examples using Proposition 3.2. Assume J(M) = M is a closed 
subset of EY .Then for every u € E% there exists a compact set of points 
in M whose distance from u is the smallest among all points in M. We 
denote this set by 


Pyu = { € Mf: lle — ull < lly — ull Vy € MD}. (4.2) 


If this set is a singleton, u is said to be a nonfocal point of EN (w.r.t. M), 
otherwise it is said to be a focal point of EN. 


Definition 3.3. Let (MM, p), J be as above. Let Q be a probability measure 
on M such that the Fréchet function 


F(t) = i. p?(, y)Q(dy) (4.3) 


is finite. The Fréchet mean (set) of Q is called the eztrinsic mean (set) of 
Q. If X;, i =1,...,n are iid observations from Q and Q, = >on Ox;; 
then the Fréchet mean(set) of Q,, is called the extrinsic sample mean(set). 


Let Q and Q,, be the images of Q and Q,, respectively in EN: Q = 
oJ, Qn =O, aI 
Proposition 3.2. (a) If i = fen uQ(du) is the mean of Q, then the ex- 


trinsic mean set of Q is given by J~'(Py ft). (b) If jt is a nonfocal point of 
EX then the extrinsic mean of Q exists (as a unique minimizer of F). 


Proof. See Proposition 3.1, [10]. O 


Corollary 3.1. If ji = fan uQ(du) is a nonfocal point of EN then the ez- 
trinsic sample mean [yn (any measurable selection from the extrinsic sample 
mean set) is a strongly consistent estimator of the extrinsic mean pw of Q. 


Proof. Follows from Proposition 3.1 for compact M. For the more gen- 
eral case, see [10]. O 
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3.4.1. Asymptotic Distribution of the Extrinsic Sample 
Mean 


Although one can apply Theorem 3.1 here, we prefer a different, and more 
widely applicable approach, which does not require that the support of Q 
be contained in a coordinate patch. Let Y = + ee Y; be the (sample) 
mean of Y; = P(X;). In a neighborhood of a nonfacal point such as fi, P(.) 
is smooth. Hence it can be shown that 


ValP(¥) — P(ji)] = Vidi P)(¥ — fi) + op(1) (4.4) 


where djP is the differential (map) of the projection P(.), which takes 
vectors in the tangent space of E’ at ji to tangent vectors of M at P(ji). 
Let f1, fo,..., fa be an orthonormal basis of T’pij) J(M) and e1, e2,...,eNn 
be an orthonormal basis (frame) for TEN ~ E%. One has 


N 
Val¥ — jt) = \(Vnl¥ — jt), e;)e;, 


N 
di P(Vn(¥ — ft) = Dvn — ft), €;)daP(e;) 
. 4 
= divan( e;) > (dz P(e;), 


d WN 
= Dald{ai-Pl es), fr\(Va¥ —ji),es)\fr- (45) 


Hence \/n[P(Y) — P(ji)] has an asymptotic Gaussian distribution on the 
tangent space of J(M) at P(j), with mean vector zero and a dispersion 
matrix (w.r.t. the basis vector {f,:1<r< d}) 


»=A'VA 
where 
A= A(ft) = (((daP(e;), fr) isi<nasr<a 


and V is the N x N covariance matrix of Q = Qo J~! (w.r.t. the basis 
{e;:1<j<N}). In matrix notation, 


JnT + N(0,¥) asn— ox, (4.6) 
where 


Ty (it) = A'((¥j — fl), ex)... (Yj — Aen, §=1y---m 
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and 


P=T(a) => 7 (p. 


This implies, writing 7? for the chi square distribution with d degrees of 
freedom, 


ar yor = X?, asn — oo. (4.7) 


A confidence region for P({1) with asymptotic confidence level 1 — a is 
then given by 


{P(f) <nT’s"? < X70. —a)} (4.8) 


where ) = (jz) is the sample covariance matrix of {T; (j1)}%_1. The cor- 
responding bootstrapped confidence region is given by 


(P(e) al SP Seay | (4.9) 


where C12) is the upper (1 — a)-quantile of the bootstrapped values U*, 
U* = nT*'d>*-!T* and T*, &* being the sample mean and covariance 
respectively of the bootstrap sample {T;(Y) pet: 


3.5. Intrinsic Means on Manifolds 


Let (M,g) be a complete connected Riemannian manifold with metric ten- 
sor g. Then the natural choice for the distance metric p in Section 3.3 is the 
geodesic distance dg on M. Unless otherwise stated, we consider the func- 
tion f(r) = r? in (3.1) throughout this section and later sections. However 
one may take more general f. For example one may consider f(r) = r“, for 
suitable a > 1. 

Let Q be a probability distribution on M with finite Fréchet function 


Fo) =f d(p,m)Q{am) (5.1) 
Let X,,..., Xn be an iid sample from Q. 


Definition 3.4. The Fréchet mean set of Q under p = dg is called the 
intrinsic mean set of Q. The Fréchet mean set of the empirical distribution 
Qn is called the sample intrinsic mean set. 


Before proceeding further, let us define a few technical terms related to 
Riemannian manifolds which we will use extensively in this section. For 
details on Riemannian Manifolds, see [19], [24] or [34]. 
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1 Geodesic: These are curves y on the manifold with zero acceleration. 
They are locally length minimizing curves. For example, consider great 
circles on the sphere or straight lines in R?. 

2 Exponential map: For p € M, v € T,M, we define exp, v = y(1), where 
7 is a geodesic with y(0) = p and 4(0) = v. 

3 Cut locus: For a point p € M, define the cut locus C(p) of p as the set 
of points of the form (to), where y is a unit speed geodesic starting at 
p and to is the supremum of all t > 0 such that y is distance minimizing 
from p to y(t). For example, C(p) = {—p} on the sphere. 

4 Sectional Curvature: Recall the notion of Gaussian curvature of two 
dimensional surfaces. On a Riemannian manifold M/, choose a pair of 
linearly independent vectors u,v € T,M. A two dimensional submanifold 
of M is swept out by the set of all geodesics starting at p and with initial 
velocities lying in the two-dimensional section 7 spanned by u,v. The 
Gaussian curvature of this submanifold is called the sectional curvature 
at p of the section 7. 

5 Injectivity Radius: Define the injectivity radius of M as 


inj(M) = inf{d,(p, C(p)) : p € M}. 
For example the sphere of radius 1 has injectivity radius equal to 7. 


Also let r. = min{inj(), Jae where C is the least upper bound of sec- 


tional curvatures of M if this upper bound is positive, and C = 0 otherwise. 
The exponential map at p is injective on {v € T,(M): |v| < rs}. By B(p,r) 
we will denote an open ball with center p € M and radius r, and B(p,r) 
will denote its closure. 

In case Q has a unique intrinsic mean j/;, it follows from Proposition 3.1 
and Remark 3.1 that the sample intrinsic mean j1,; (a measurable selection 
from the sample intrinsic mean set) is a consistent estimator of wr. Broad 
conditions for the existence of a unique intrinsic mean are not known. From 
results due to [26] and [33], it follows that if the support of Q is in a geodesic 


T 
mean. This result has been substantially extended by [31] which shows that 


ball of radius ie. supp(Q) C B(p, }), then Q has a unique intrinsic 
if supp(Q) C B(p, >), then there is a unique local minimum of the Fréchet 
function F’ in that ball. Then we redefine the (local) intrinsic mean of Q 
as that unique minimizer in the ball. In that case one can show that the 
(local) sample intrinsic mean is a consistent estimator of the intrinsic mean 
of Q. This is stated in Proposition 3.3. 
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Proposition 3.3. Let Q have support in B(p, 3) for some p€ M. Then 
(a) Q has a unique (local) intrinsic mean jy in B(p, }) and (b) the sample 
intrinsic mean [nr in B(p,*}) is a strongly consistent estimator of [U1. 


Proof. (a) Follows from [31]. 

(b) Since supp(Q) is compact, supp(Q) C B(p,r) for some r < }. From 
Lemma 1, [33], it follows that wr; € B(p,r) and pz is the unique intrinsic 
mean of Q restricted to B(p,r). Now take the compact metric space in 


Proposition 3.1 to be B(p,r) and the result follows. 


For the asymptotic distribution of the sample intrinsic mean, we may use 
Theorem 3.1. For that we need to verify assumptions Al-A7. Theorem 3.2 
gives sufficient conditions for that. In the statement of the theorem, the 
usual partial order A > B between d x d symmetric matrices A, B, means 
that A — B is nonnegative definite. 


Theorem 3.2. Assume supp(Q) C B(p, >). Let @ = ext (27. ) == 
T.,;M(& R¢). Then the map y > h(z,y) = d?(o-'x,@~'y) is twice con- 
tinuously differentiable in a neighborhood of 0 and in terms of normal co- 
ordinates with respect to a chosen orthonormal basis for T,,,M, 


Dhl; 0) =—22",, Lars, (5.2) 
PrD.h(a,0)] = |2(=AEP ) are + slain]. G8) 
l<r,s<d 
Here x = (a1,..., 2%), |x| = ./(x1)2 + (x2)? +... (a4)? and 
1 fC =O, 


VGy2XVEn ip Gs 0 
fly) = Y ante Gus i? 


/_C© cosh(1/—Cy) . CG 0. 
cinta —Cy) if = 


There is equality in (5.3) when M has constant sectional curvature C, and 


(5.4) 


in this case A has the expression: 


1— f(|Xil) 


Ars = 2e(( ar ATX? + f((Rillors}s Lanssd. 6.5) 
1 


A is positive definite if supp(Q) € B(ur, 5). 
Proof. See Theorem 2.2, [8]. O 
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From Theorem 3.2 it follows that © = 4Cov(Y1) where Y; = $(X;), 
j =1,...,n are the normal coordinates of the sample X,,...,X, from Q. 
It is nonsingular if Qo ¢~' has support in no smaller dimensional subspace 
of R¢. That holds if for example Q has a density with respect to the volume 
measure on M. 


3.6. Applications 


In this section we apply the results of the earlier sections to some important 
manifolds. We start with the unit sphere S¢ in R¢*1. 


3.6.1. S¢ 


Consider the space of all directions in R¢+! which can be identified with 
the unit sphere 


S¢={2eR@!: |x|) = 1}. 


Statistics on S?, often called directional statistics, have been among the ear- 
liest and most widely used statistics on manifolds. (See, e.g., [42], [23], [35]). 
Among important applications, we cite paleomagnetism, where one may de- 
tect and/or study the shifting of magnetic poles on earth over geological 
times. Another application is the estimation of the direction of a signal. 


3.6.1.1. Extrinsic Mean on S@ 


The inclusion map i : S¢ — R%*!, i(x) = x provides a natural embedding 
for S¢ into R¢+!. The extrinsic mean set of a probability distribution Q 
on S* is then the set Pgaji on S% closest to fi = fears aQ(dx), where Q is 
Q regarded as a probability measure on R?+!. Note that ji is non-focal iff 
jt ~ 0 and then Q has a unique extrinsic mean ps = ae 


3.6.1.2. Intrinsic Mean on S4 


At each p € $4, endow the tangent space T,S¢ = {v € R¢+! : v.p = 0} with 
the metric tensor g, : T, x T, — R as the restriction of the scalar product 
at p of the tangent space of R¢+! : Gp(U1, V2) = V1.2. The geodesics are 
the big circles, 

Vv 


7 (6.1) 


Yp.v(t) = (cos tlv|)p + (sin t|v]) 
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The exponential map, exp, : T,S4 — $7 is 


U 


expp(v) = cos(|v|)p + sin(|0|) rol’ (6.2) 
and the geodesic distance is 
d,(p,q) = arccos(p.q) € [0, z]. (6.3) 


This space has constant sectional curvature 1 and injectivity radius 7. 


Hence if Q has support in an open ball of radius >, then it has a unique 


intrinsic mean in that ball. 


3.6.2. RP?@ 


Consider the real projective space RP® of all lines through the origin in 
R*+!. The elements of RP? may be represented as [u] = {—u, u} (u € S%). 


3.6.2.1. Extrinsic Mean on RP? 


RP? can be embedded into the space of k x k real symmetric matrices 
S(k,R), k =d+1 via the Veronese- Whitney embedding J : RP4 — S(k,R) 
which is given by 


J([u]) = uu! = ((uiuy))r<igce (w= (ur, -.,ux)! € S*). (6.4) 
As a linear subspace of R®, (k, R) has the Euclidean distance 


JA-BI? = S° (aij — 043)? = Trace(A — B)(A — BY’. (6.5) 


1<i,j<k 
This endows RP? with the extrinsic distance p given by 
p*({u}, [v]) = |Jwu” — vo" ||? = 20 — (u'v)’). (6.6) 


Let Q be a probability distribution on RP% and let ji be the mean of 
Q = QoJ7! considered as a probability measure on S(k,R). Then fi € 
S*(k,R)-the space of k x k real symmetric nonnegative definite matrices, 
and the projection of ji into J(RP7%) is given by the set of all uu’ where u is 
a unit eigenvector of f corresponding to the largest eigenvalue. Hence the 
projection is unique, i.e. fis nonfocal iff its largest eigenvalue is simple, i.e., 
if the eigenspace corresponding to the largest eigenvalue is one dimensional. 
In that case the extrinsic mean of Q is [u], wu being a unit eigenvector in 
the eigenspace of the largest eigenvalue. 
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3.6.2.2. Intrinsic Mean on RP? 
RP? is a complete Riemannian manifold with geodesic distance 


dq({[p],[a]) = arceos(|p.q|) € (0, 5 (6.7) 


It has constant sectional curvature 4 and injectivity radius >. Hence if the 
support of Q is contained in an open geodesic ball of radius 7, it has a 
unique intrinsic mean in that ball. 


k 
56.3, SE 


Consider a set of k points in R™, not all points being the same. Such a set 
is called a k-ad or a configuration of k landmarks. We will denote a k-ad by 
the m x k matrix, x = [x ...x,%] where x;, i = 1,...,k are the k landmarks 
from the object of interest. Assume k > m. The direct similarity shape of 
the k-ad is what remains after we remove the effects of translation, rotation 
and scaling. To remove translation, we subtract the mean 7 = 1 ey 
from each landmark to get the centered k-ad w = [x1 — Z...x, — Z]. We 
remove the effect of scaling by dividing w by its euclidean norm to get 


u=| 


This wu is called the preshape of the k-ad x and it lies in the unit sphere Se 
in the hyperplane 


Li —2£ Ch — & 
Tt | 8) = [uzue... ug). (6.8) 
[|| ||| 


k 
He ={ueR™: Su; = 0}. (6.9) 
j=l 
Thus the preshape space S*, may be identified with the sphere S*™—™~?. 
Then the shape of the k-ad x is the orbit of z under left multiplication by 
m Xm rotation matrices. In other words =k, = S*™—-™—1!/SO(m). The 


cases of importance are m = 2,3. Next we turn to the case m = 2. 


3.6.4. 3 


As pointed out in Sections 3.2.2 and 3.6.3, US = S?*-3/90(2). For a 
simpler representation, we denote a k-ad in the plane by a set of k complex 
numbers. The preshape of this complex k-vector x is z = Teal” as 


(@i5.1.52,)EC", 2= 1 x;. z lies in the complex sphere 


k k 
Ss=~ec™ >) ar =1, ) e=0} (6.10) 
j=l j=l 
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which may be identified with the real sphere of dimension 2k — 3. Then the 
shape of x can be represented as the orbit 


o(x) = o(z) = {e®z: —47 <0< 7} (6.11) 
and 
oF = {o(z): 2 € SF}. (6.12) 


Thus % has the structure of the complex projective space CP*~? of all 
complex lines through the origin in C*~!, an important and well studied 
manifold in differential geometry (See [24], pp. 63-65, 97-100, [8]). 


3.6.4.1. Extrinsic Mean on =k 


b* can be embedded into S(k,C), the space of k x k complex Hermitian 
matrices, via the Veronese-Whitney embedding 


J: DE S(k,C), J(o(z)) = 22". (6.13) 


J is equivariant under the action of SU(k), the group of k x k complex 
matrices I such that [*I = J, det([) = 1. To see this, let I € SU(k). 
Then I defines a diffeormorphism, 


TP: SS 4 DF, P(o(z)) = o(T (2). (6.14) 
The map ¢r on S(k,C) defined by 
gr(A) = PAT* (6.15) 
preserves distances and has the property 


(¢r)~* = dr-1, Oryre = or, ° Gre: (6.16) 


That is, (6.15) defines a group homomorphism from SU(k) into a group of 
isometries of S(k,C). Finally note that J(['(o(z))) = ¢r(J(o(z))). Infor- 
mally, the symmetries SU(k) of S% are preserved by the embedding JJ. 

S(k,C) is a (real) vector space of dimension k?. It has the Euclidean 
distance, 


||A — BI? = $— las; — bi;||? = Trace(A — B)?. (6.17) 
tJ 
Thus the extrinsic distance p on S*% induced from the Veronese-Whitney 


embedding is given by 
p(o(x),o0(y)) = lluu* — vv" 2 = 2(1 — ju*vl?), (6.18) 
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where x and y are two k-ads, u and v are their preshapes respectively. 

Let Q be a probability distribution on % and let fi be the mean of 
Q = QoJ—!, regarded as a probability measure on Ce’. Then ft € S4(k,C): 
the space of k x k complex positive semidefinite matrices. Its projection 
into J(D%) is given by P(ji) = {uu*} where wu is a unit eigenvector of fi 
corresponding to its largest eigenvalue. The projection is unique, i.e. jf is 
nonfocal, and Q has a unique extrinsic mean jp, iff the eigenspace for the 
largest eigenvalue of fi is (complex) one dimensional, and then uz = o(u), 
u(# 0) € eigenspace of the largest eigenvalue of f. Let Xj,...Xn be 
an iid sample from Q. If jf is nonfocal, the sample extrinsic mean [ng 
is a consistent estimator of 4g and J(Unz) has an asymptotic Gaussian 
distribution on the tangent space Tp(j)J(X$) (see Section 3.4), 


Va I (tne) — F(z) = Vndg P(X — ji) + op(1) + N(0,). (6.19) 


Here X; = J(X;), j =1,...,n. In (6.19), dg P(X — jz) has coordinates 

T (ji) = (V2Re(Uz XUz), V2m(UzZ XUg)) kA (6.20) 
with respect to the basis 

fi, =n) eer, Oy — Ag) Oe (6.21) 


a=2 


for Tp) J(Z4) (see Section 3.3, [7]). Here U = [U1 ...U,] € SO(k) is such 
that U*jiU = D = Diag(Ai,...,A%), A1 < --- < Ae-1 < Ax being the 
eigenvalues of f. {ug :1<a<b<k} and {wi :1<a<b< k} is the 
canonical orthonormal basis frame for S(k,C), defined as 


—epe’),a<b (6.22) 


where {€, : 1 <a <k} is the standard canonical basis for R*. 
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Given two independent samples X,,...X, iid Q; and Yj,...Y¥m iid 
Qs on DS, we may like to test if Q1 = Qo by comparing their extrinsic 
mean shapes. Let j4;g denote the extrinsic mean of Q; and let ju; be the 
mean of Q;o J~' i = 1,2. Then pig = J~'P(p;), and we wish to test 
Ho : P(p1) = P(ji2). Let X; = J(X;), j = 1,...,n and Y; = J(Y;), 
j =1,...,m. Let T;,S; denote the asymptotic coordinates for Xj, Y; re- 
spectively in Tg) J(X$) as defined in (6.20). Here fi = nxn is the 
pooled sample mean. We use the two sample test statistic 


Tome ={T 8) (—1 re — 22) (L 5); (6.23) 


Here 41,2 denote the sample covariances of T;,S; respectively. Under 


Hes Tiree a X34 (see Section 3.4, [7]). Hence given level a, we reject 
Ho if Tam > ¥3,_4(1—Q). 


3.6.4.2. Intrinsic Mean on d§ 


Identified with CP*~?, Sf is a complete connected Riemannian manifold. It 
has all sectional curvatures bounded between | and 4 and injectivity radius 
of % (see [24], pp. 97-100, 134). Hence if supp(Q) € B(p, 4), p € US, it has 
a unique intrinsic mean j1; in the ball. 

Let X1,...X»n be iid Q and let zz; denote the sample intrinsic mean. 
Under the hypothesis of Theorem 3.2, 


Vn (Unt) — 6(ur)) + N(O, AEA). (6.24) 


However Theorem 3.2 does not provide an analytic computation of A, since 
D* does not have constant sectional curvature. Proposition 3.4 below gives 
the precise expression for A. It also relaxes the support condition required 
for A to be positive definite. 


Proposition 3.4. With respect to normal coordinates, ¢ : B(p,=) — 
Ch-?(= R?*-4), A as defined in Theorem 3.1 has the following expression: 


_ | Ai Are 
a= [An Se 25 
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where forl<r,s<k-—2, 


{1 —d, cot(d,)} 


(Aii)rs = 2E[ dy cot(d1) drs — de 


(ReX,,-)(ReX1,5) 


tan(d;) 
dy 


+ (ImXy,,)(ImX1,5) |, 


{1 —= dy cot(d,)} 


7) (ImX,,)(ImX1,5) 
1 


(A22)rs = 2B[ dy cot(d1)d;-s 


tan(d;) 
dy 
{1 = d,cot(d, )} 


+ 


(ReX1,,)(ReX1,s) iF 


(Ai2)rs = —2E| (ReX},,)(ImX1,5) 


where dy = dg (Xi, Lr) and X; = (Xj, see yj nid) => o(X;), gq => a ae wug (Us 
A is positive definite if supp(Q) € B(ur,0.377). 


Proof. See Theorem 3.1, [8]. O 


Note that with respect to a chosen orthonormal basis {v1,...,v%—2} for 
Ti , Ls, ¢ has the expression 


where 


°G;'z, 7 = dg(m, pr) = arccos(|zpzZ|), ef? = ial — 
0 


voi 
™; =- e 
sinr 
Here z, zo are the preshapes of m, zy respectively (see Section 3, {8]). 
Given two independent samples X1,...X, iid Q; and Yj,...Ym iid Qo, 
one may test if Q; and Qo have the same intrinsic mean zy. The test 
statistic used is 


Tam = (n + m)(b(Hnr) _ (mr))' 1 (b(pnr) _ (Umr))- (6.27) 


Here fny and [my are the sample intrinsic means for the X and Y 
samples respectively and fi is the pooled sample intrinsic mean. Then 
= exp; gives normal coordinates on the tangent space at ji, and 


s = (m+n) (2Ay1S,Az? + LAs S2A31), where (Aj, 41) and (Ag, X2) 
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are the parameters in the asymptotic distribution of /n(¢(unr)—o(u7)) and 
Vm((tmr — o(f41)) respectively, as defined in Theorem 3.1, and (Ay, 1) 
and (Az, Se) are consistent sample estimates. Assuming Hp to be true, 
Teves a X34, (see Section 4.1, [7]). Hence we reject Ho at asymptotic 
level 1 —a@ if Tam > ¥3,_,(1—Q). 


3.6.5. Ru* 


For m > 2, the direct similarity shape space U*, fails to be a manifold. 
That is because the action of SO(m) is not in general free (see, e.g., [30] 
and [39]). To avoid that one may consider the shape of only those k-ads 
whose preshapes have rank at least m —1. This subset is a manifold but 
not complete (in its geodesic distance). Alternatively one may also remove 
the effect of reflection and redefine shape of a k-ad x as 


o(x) = o(z) = {Az: AE O(m)} (6.28) 


where z is the preshape. Then RD*, is the space of all such shapes where 
rank of z is m. In other words 


RD = {o(z): z € SE, rank(z) = m}. (6.29) 
This is a manifold. It has been shown that the map 
J+ RUE, — S(&,R), J(e(z)) = 2'2 (6.30) 


is an embedding of the reflection shape space into S(k,R) (see [2], [1], 
and [20]) and is H-equivariant where H = O(k) acts on the right: Ao(z) = 
o(zA’), A € O(k). 

Let Q be a probability distribution on RD*, and let fi be the mean of 
Qo J~! regarded as a probability measure on S(k,R). Then ji is positive 
semi-definite with rank atleast m. Let 4 = UDU’ be the singular value 
decomposition of ji, where D = Diag(Ai,...,A,) consists of ordered eigen 
values Ay 2 see Ag 2 axe Ap] OOF p, and U = (07... 16 a matrix 
in SO(k) whose columns are the corresponding orthonormal eigen vectors. 
It has been shown in [6] that the extrinsic mean reflection shape set of Q 
has the following expression: 


{He REM: J (1) Jy -1+ 2yu0;} (6.31) 


9=1, 


where \ = 4 ae Aj. The set in (6.31) is a singleton, and hence Q has a 
unique mean reflection shape yu iff Am > Am+i. Then = o(u) where 


w= [YOr- A+ =), --4/ Am A+ =)! (6.32) 
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3.6.6. AX® 


Let z be a centered k-ad in H(m,k), and let o(z) denote its affine shape, 
as defined in Section 3.2.4. Consider the map 


J: ADK, — $(k,R), J(o(z)) = P = FF' (6.33) 


where F = [fi fo... fm] is an orthonormal basis for the row space of z. This 
is an embedding of AD*, into $(k,R) with the image 


J(AD*) = {A € S(k,R): A? = A, Trace(A) =m, A1=0}. — (6.34) 
It is equivariant under the action of O(k) (see [18]). 


Proposition 3.5. Let Q be a probability distribution on AD*, and let ju be 
the mean of Qo J~+ in S(k,R). The projection of ji into J(AD*,) is given 
by 


P(ji) = {)1U;U;} (6.35) 


where U = [U,...Ux] € SO(k) is such that U'fiU = D = Diag(\i,..., Ax), 
Ay >... > Am >... > Ax. ft is nonfocal and Q has a unique extrinsic 
mean pie if Am > Amz. Then pe = o(F’) where F = |U,...Uy,]. 


Proof. See [41]. O 


3.6.7. PoXk, 


Consider the diffeomorphism between P;>*, and (RP™)*~™~? as defined 
in Section 3.2.5. Using that one can embedd P;=*, into S(m+1,R)*~™~? 
via the Veronese Whitney embedding of Section 3.6.2 and perform extrinsic 
analysis in a dense open subset of Pyd*.,. 


3.7. Examples 


3.7.1. Example 1: Gorilla Skulls 


To test the difference in the shapes of skulls of male and female gorillas, 
eight landmarks were chosen on the midline plane of the skulls of 29 male 
and 30 female gorillas. The data can be found in [21], pp. 317-318. Thus 
we have two iid samples in ©, k = 8. The sample extrinsic mean shapes 
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Female sample preshapes 


Male sample preshapes 


(b) 


Fig. 3.1. (a) and (b) show 8 landmarks from skulls of 30 female and 29 male gorillas, 
respectively, along with the mean shapes. * correspond to the mean shapes’ landmarks. 


for the female and male samples are denoted by jij¢ and fi2g where 
(ug = o[—0.3586 + 0.34252, 0.3421 — 0.29437, 0.0851 — 0.35191, 
— 0.0085 — 0.23887, —0.1675 + 0.002172, —0.2766 + 0.30502, 
0.0587 + 0.23537, 0.3253], 
fiaz = o|—0.3692 + 0.33862, 0.3548 — 0.26417, 0.1246 — 0.33201, 
0.0245 — 0.25622, —0.1792 — 0.01797, —0.3016 + 0.30722, 
0.0438 + 0.22452, 0.3022]. 
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3 means 


——— Female 
—1— Male 
—+*— Pooled mean 


Fig. 3.2. The sample extrinsic means for the 2 groups along with the pooled sample 
mean, corresponding to Figure 3.1. 


The corresponding intrinsic mean shapes are denoted by fii4; and fiz. 
They are very close to the extrinsic means (dg(fi1z, fiir) = 5.5395 x 10%, 
dg (flan, far) = 1.9609 x 10~°). Figure 3.1 shows the preshapes of the sam- 
ple k-ads along with that of the extrinsic mean. The sample preshapes have 
been rotated appropriately so as to minimize the Euclidean distance from 
the mean preshape. Figure 3.2 shows the preshapes of the extrinsic means 
for the two samples along with that of the pooled sample extrinsic mean. 
In [7], nonparametric two sample tests are performed to compare the mean 
shapes. The statistics (6.23) and (6.27) yield the following values: 


Extrinsic: Tnm = 392.6, p-value = P(42, > 392.6) < 10716. 
Intrinsic: Tm = 391.63, p-value = P(X? > 391.63) < 10~'°. 


A parametric F-test ((21], pp. 154) yields F = 26.47, p-value = P(F\2,46 > 
26.47) = 0.0001. A parametric (Normal) model for Bookstein coordinates 
leads to the Hotelling’s T? test ( [21], pp. 170-172) yields the p-value 0.0001. 
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3.7.2. Example 2: Schizophrenic Children 


In this example from [15], 13 landmarks are recorded on a midsagittal 
two-dimensional slice from a Magnetic Resonance brain scan of each of 14 
schizophrenic children and 14 normal children. In [7], nonparametric two 
sample tests are performed to compare the extrinsic and intrinsic mean 
shapes of the two samples. The values of the two-sample test statistics 
(6.23), (6.27), along with the p-values are as follows. 


Extrinsic: Tam = 95.5476, p-value = P(¥%, > 95.5476) = 3.8 x 107". 
Intrinsic: Tm = 95.4587, p-value = P(¥3, > 95.4587) = 3.97 x 1071}. 


The value of the likelihood ratio test statistic, using the so-called offset nor- 
mal shape distribution ({21], pp. 145-146) is —2log A = 43.124, p-value = 
P(X3, > 43.124) = 0.005. The corresponding values of Goodall’s F-statistic 
and Bookstein’s Monte Carlo test ([21], pp. 145-146) are Foo.572 = 1.89, 
p-value = P(F52,572 > 1.89) = 0.01. The p-value for Bookstein’s test = 
0.04. 


3.7.3. Example 3: Glaucoma Detection 


To detect any shape change due to Glaucoma, 3D images of the Optic 
Nerve Head (ONH) of both eyes of 12 rhesus monkeys were collected. One 
of the eyes was treated while the other was left untreated. 5 landmarks 
were recorded on each eye and their reflection shape was considered in 
Rd’, k =5. For details on landmark registration, see [17]. The landmark 
coordinates can be found in [11]. Figure 3.3 shows the preshapes of the 
sample k-ads along with that of the mean shapes. The sample points have 
been rotated and (or) reflected so as to minimize their Euclidean distance 
from the mean preshapes. Figure 3.4 shows the preshapes of the mean 
shapes for the two eyes along with that of the pooled sample mean shape. 
In [1], 4 landmarks are selected and the sample mean shapes of the two 
eyes are compared. Five local coordinates are used in the neighborhood of 
the mean to compute Bonferroni type Bootstrap Confidence Intervals for 
the difference between the local reflection similarity shape coordinates of 
the paired glaucomatous versus control eye (see Section 6.1, [1] for details). 
It is found that the means are different at 1% level of significance. 
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landmarks for treated eyes along with the extrinsic mean 


landmarks for untreated eyes along with the extrinsic mean 


Fig. 3.3. (a) and (b) show 5 landmarks from treated and untreated eyes of 12 monkeys, 
respectively, along with the mean shapes. * correspond to the mean shapes’ landmarks. 
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* Untreated mean 
* — Treated 
* = Pooled 


O04 -1 


Fig. 3.4. The sample means for the 2 eyes along with the pooled sample mean, corre- 
sponding to Figure 3.3. 
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This article advocates the viewpoint that reinforcement learning algo- 
rithms, primarily meant for approximate dynamic programming, can 
also be cast as a technique for estimating stationary averages and sta- 
tionary distributions of Markov chains. In this role, they lie somewhere 
between standard deterministic numerical schemes and Markov chain 
Monte Carlo, and capture a trade-off between the advantages of either — 
lower per iterate computation than the former and lower variance than 
the latter. Issues arising from the ‘curse of dimensionality’ and conver- 
gence rate are also discussed. 


4.1. Introduction 


The genealogy of reinforcement learning goes back to mathematical psy- 
chology ([16], [17], [20], [21], [37]). The current excitement in the field, 
however, is spurred by its more recent application to dynamic programming, 
originating in the twin disciplines of machine learning and control engineer- 
ing ([8], [34], [36]). A somewhat simplistic but nevertheless fairly accurate 
view of these schemes is that they replace a classical recursive scheme for 
dynamic programming by a stochastic approximation based incremental 
scheme, which exploits the averaging properties of stochastic approxima- 
tion in order to do away with the conditional expectation operator intrinsic 
to the former. In particular, these schemes can be used for simulation based 
or online, possibly approximate, solution of dynamic programmes. 
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In this article, we advocate a somewhat different, albeit related, appli- 
cation for this paradigm, viz., to solve two classical problems for Markov 
chains: the problem of estimating the stationary expectation of a prescribed 
function of an ergodic Markov chain, and that of estimating stationary dis- 
tribution of such a chain. The former problem is the linear, or ‘policy 
evaluation’ variant of the learning algorithm for average cost dynamic pro- 
gramming equation ({1], [26], [38], [39]). The latter in turn has been of 
great importance in queuing networks (see, e.g., [35]) and more recently, 
in the celebrated ‘PageRank’ scheme for ranking of web sites ({27]). The 
scheme proposed here is a variant of a general scheme proposed recently 
in [5], [11] for estimating the Perron-Frobenius eigenvector of a nonnegative 
matrix. See [19] for an earlier effort in this direction. 


Analogous ideas have also been proposed in [9] in a related context. 


The article is organized as follows: As stochastic approximation theory 
forms the essential backdrop for this development, the next section is de- 
voted to recapitulating the essentials thereof that are relevant here. Section 
4.3 recalls the relevant results of [1], [26], [39] in the context of the prob- 
lem of estimating the stationary expectation of a prescribed function of an 
ergodic Markov chain. A ‘function approximation’ variant of this, which 
addresses the curse of dimensionality by adding another layer of approx- 
imation, is described in Section 4.4. Section 4.5 first recalls the relevant 
developments of [5], [11] and then highlights their application to estimating 
stationary distributions of ergodic Markov chains. Section 4.6 discusses the 
issues that affect the speed of such algorithms and suggests some techniques 
for accelerating their convergence. These draw upon ideas from [2] and 
[13]. Section 4.7 concludes with brief pointers to some potential research 
directions. 


4.2. Stochastic Approximation 


The archetypical stochastic approximation algorithm is the d-dimensional 
recursion 


In+1 = Ln +a(n)[A(an, Yn) + Mnii + En, (4.1) 


where, for F;, pd O(Xm;¥m;Mmj;m <n),n > 0, the following hold: 
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e (M,,Fn),n > 0, is a martingale difference sequence, i.e., {My} is an 
{F,,}-adapted integrable process satisfying 


E[Mn41|Fn] = Vn. 
Furthermore, it satisfies 
El\|Mn+1||"|Fn] < K(1 + |lenl|?) Vn 


for some K > 0. 
{Y,,} is a process taking values in a finite state space S, satisfying 


P(Yn41 = y|Fn) = Wry, (y|Yn) 


for a continuously parametrized family of transition probability functions 
x — qx(-|:) on S. For each fixed x, it is a Markov chain with transition 
matrix q,(-|-), assumed to be irreducible. In particular, it then has a 
unique stationary distribution vy. 


{e€,} is a random sequence satisfying €, — 0 as. 


{a(n)} are positive step sizes satisfying 


Y= a(n) = 6) S/ a(n)? <= 66, (4.2) 


nm n 


eh: R’x S — R? is Lipschitz in the first argument. 


Variations and generalizations of these conditions are possible, but the 
above will suffice for our purposes. The ‘o.d.e.’ approach to the analy- 
sis of stochastic approximation, which goes back to [18], [28], views (4.1) as 
a noisy discretization of the ordinary differential equation (o.d.e. for short) 


é(t) = h(x(t)). (4.3) 


Here h(z) of Dies A(x, t)Ve(%) will be Lipschitz under our hypotheses. 
This ensures the well-posedness of (4.3). Recall that a set A is invariant 
for (4.3) if any trajectory z(t),t € R, of (4.3) which is in A at time t = 0 
remains in A for all t € R. It is said to be internally chain transitive 
if in addition, for any z,y € A, and «,7 > O, there exist n > 2 and 
XL = @1,22,...,Ln = y in A such that the trajectory of (4.3) initiated at 
x; intersects the open e-ball centered at x;41 at some time t; > J for all 
1<1z< 7. Suppose 


sup ||zn|| < co as. (4.4) 
n 
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Then a result due to Benaim ([6]) states that almost surely, x, converges 
to a compact connected internally chain transitive invariant set of (4.3). 
In case the only such sets are the equilibria of (4.3) (i.e., points where h 
vanishes), then x, converges to the set of equilibria of (4.3) a.s. If these 
are finitely many, then a.s. 2, converges to some possibly sample path 
dependent equilibrium. In particular, when there is only one such, say 2*, 
Ln — x* as. This is the situation of greatest interest in most algorithmic 
applications. 

The ‘stability’ condition (4.4) usually needs to be separately established. 
There are several criteria for the purpose. One such, adapted from [12], is 


the following: Suppose the limit 


def ,. h(ax) 
CoO = 4. 
(1) tim (45) 
exists for all x, and the o.d.e. 
&(t) = hoo(x(t)) (4.6) 


has the origin as the globally asymptotically stable equilibrium. Then (4.4) 
holds. (Note that hx. will perforce be Lipschitz with the same Lipschitz 
constant as h and therefore the well-posedness of (4.6) is not an issue.) This 
is a slight variant of Theorem 2.1 of [12] and can be proved analogously. 

Let I'(i) denote the d x d matrix with all entries zero except the (i, 7)-th 
entry, which is one. A special case of the foregoing is the iteration 


Inti = In +a(n)U(Xp)[h(an, Yn) + Mngi + En, 


where {X,,} is an irreducible Markov chain on the index set {1,2,...,d}. 
This is a special case of ‘asynchronous stochastic approximation’ studied 
in [10]. It corresponds to the situation when only one component of the 
vector Yn, viz., the X,,th, is updated at time n for all n, in accordance with 
the above dynamics. The preceding theory then says that the o.d.e. limit 
will change to 


z(t) = AA(ax(t)). 
Here A is a diagonal matrix whose ith diagonal element equals the sta- 
tionary probability of i for the chain {X,}. It thus captures the effect 
of different components being updated with different relative frequencies. 
While this change does not matter for some dynamics (in the sense that it 


Reinforcement Learning 75 


does not alter their asymptotic behaviour), it is undesirable for some oth- 


ers. One way of getting around this ([10]) is to replace the step size a(n) 


by a(v(X,,n)), where v(i,n) = So _g 1{Xm = i} is the local clock at i: 


m=0 
the number of times ith component got updated till time n. Under some 
additional hypotheses on {a(n)}, this ensures that A above gets replaced 
by 4 times the identity matrix, whereby the above o.d.e. has the same 
qualitative behaviour as (4.3), but with a change of time-scale. See [10] for 


details. 


4.3. Estimating Stationary Averages 


The problem of interest here is the following: Given an irreducible Markov 
chain {X,,} on a finite state space S (= {1,2,...,s}, say) with transition 
probability matrix P = [[p(j|t)]Jijes, and the unique stationary distribu- 


tion 7, we want to estimate the stationary average 3 = Dies f(4)n(4) for 
a prescribed f : S — R. The standard Monte Carlo approach would be to 
use the sample average 


1 N 
1%) (4.7) 


for large N as the estimate. This is justified by the strong law of large 
numbers for Markov chains, which states that (4.7) converges a.s. to 3 as 
N { co. Even in cases where 7) is known, there may be strong motivation 
for going for a Markov chain with stationary distribution 7 instead of i.i.d. 
random variables with law 7: in typical applications, the latter are hard to 
simulate, but the Markov chain, which usually has a simple local transition 
rule, is not. This is the case, e.g., in many statistical physics applications. 

The problem in most applications is that this convergence is very slow. 
This typically happens because S is very large and the chain makes only 
local moves, moving from any 2 € S to one of its ‘neighbors’. Thus the 
very aspect that makes it easy to simulate works against fast convergence. 
For example, one may have S = {1,2,...,M}4 for some M,d >> 1 and 
the chain may move from any point therein only to points which differ in 
at most one of the d components by at most 1. In addition the state space 
may be ‘nearly decoupled’ into ‘almost invariant’ subsets such that the 
transitions between them are rare, and thus the chain spends a long time 
in whichever of these sets it finds itself in. As a result, the chain is very slow 
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in scanning the entire state space, leading to slow convergence of the above 
estimate. Also, the variance of this estimate can be quite high. There are 
many classical techniques for improving the performance of Monte Carlo, 
such as importance sampling, use of control variates, antithetic variates, 
stratified sampling, etc. See [3] for an extensive and up to date account of 
these. 

An alternative linear algebraic view of the problem would be to look at 
the associated Poisson equation: 


Vii) =f@-B'+ > PIV (3); ieS. (4.8) 


This is an equation in the pair (V(-),@’). Under our hypotheses, /3’ is 
uniquely characterized as 3’ = 3 and V(-) is unique up to an additive scalar. 
As clear from (4.8), this is the best one can expect, since the equation is 
unaltered if one adds a scalar to all components of V. We shall denote by 
H the set of V such that (V, 3) satisfies (4.8). Thus H = {V: V = V*+ce, 


c € R} where (V*, 3) is any fixed solution to (4.8), and e “<! the vector of 
all 1’s. The relative value iteration algorithm for solving (4.8) is given by 
the iteration 


Vn+1 (it) = f(i) _ Vn (io) + > PCL) Vala); te S, (4.9) 


where ig € S is a fixed state. 


Remarks: This is a special case of the more general relative value iter- 
ation algorithm for solving the (nonlinear) dynamic programming equa- 
tion associated with the average cost control of a controlled Markov chain 
((7], section 4.3). The convergence results for the latter specialized to the 
present case lead to the conclusions V,, — V and V,(i9) — @, where (V, 3) 
is the unique solution to (4.8) corresponding to V(io) = 3. The choice 
of the ‘offset’ V,(i9) above is not unique, one can replace it by g(V,,) for 
any g satisfying g(e) = 1,g(x + ce) = g(x) + ¢ for all c € R — see [1], 
p. 684. This choice will dictate which solution of (4.8) gets picked in the 
limit. Nevertheless, in all cases, g(V;,) — (@, which is the quantity of inter- 
est. In what follows, the analysis will have to be suitably modified in a few 
places for choices of g other than the specific one considered here. 
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To obtain a reinforcement learning variant of (4.9), one follows a stan- 
dard schematic. Suppose that at time n, X, = 7. The first step is to 
replace the conditional expectation on the right hand side of (4.9) by an 
actual evaluation at the observed next state, i.e., by 


f (i) ~~ Vn (io) = aC. coaee e 


The second step is to make an incremental correction towards this quantity, 
i.e., to replace V,,(2) by a convex combination of itself and the above, with 
a small weight a(n) for the latter. The remaining components V,,(j), 7 4 4, 
remain unaltered. Thus the update is 


Vn+1(2) 
= (1—a(n)I{Xn = t})Vn(i) + a(n)L{ Xn = LF (%) — Vn (io) + Vn(Xn+1)] 
= Va (i) + a(n)I{Xn = OIF (8) — Valio) + Va(Xn41)— Val) (4.10) 


This can be rewritten as 
Vn41 (4) =V,(i) + a(n)I{Xn = i}[Ti(Vn) — Vn (io) + Mn+1(t) = V,(4)], 
where T(-) = [Ti(-),..-,;Ts(-)]” is given by 
def 


Tila) = (k) + > pil) 


for « = [17,...,25|7 € R®, and for n > 0, 
. di ‘ : 
Mn+1(J) az f(j) _ Vn(Xn41) ~~ Ee Vew)y n 2 0, lL. = J < 5, 
is a martingale difference sequence w.r.t. Fy, = Ol Xen < H), Let D 


denote the diagonal matrix whose ith diagonal element is (7) for i € S. 
Then the counterpart of (4.3) for this case is 


(t) = D(T(a(t)) — xi, (t)e — x(t)) 
= T(2(t)) — vig (t)n — a(t) (4.11) 
def 


where T(x) “= (I — D)x + DT(x), x € R*. We shall analyze (4.11) by 
relating it to a secondary o.d.e. 


i(t) = FE) — Bn — (0). (4.12) 
It is easily verified that the map T defined by T(x) = T(x) — 6n is non- 


‘ d : 
expansive w.r.t. the max-norm : ||z||”. Fax; 7 |, and has H as its set 
of fixed points. Thus by the results of [14], Z(t) converges to some point 
in H, depending on %(0), as t | co. One can then mimic the arguments of 


Lemmas 3.1—Theorem 3.4 of [1] step for step to claim successively that: 
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e If x(0) = x(0), then x(t) — &(t) is of the form r(t)n for some r(t) that 
converges to a constant as t J oo. (The only changes required in the 
arguments of [1] are that, (i) in the proof of Lemma 3.3 therein, use the 


3 : def : , 7 
weighted span seminorm ||2||,,5 = max; (z+) — min; (z) instead of 


F def s xd ‘ 
the span semi-norm ||z||, = max; 2; — min; 2;, and, (iz) r(-) satisfies a 
different convergent o.d.e., viz., 


F(t) = —n(io)r(t) + (8 — Fig (t)), 


than the one used in [1]*.) 

e x(t) therefore converges to an equilibrium point of (4.11), which is seen 
to be unique and equal to the unique element V* of H characterized 
by V*(io) = @. In fact, V* is the unique globally asymptotically stable 
equilibrium for (4.11). 


Stochastic approximation theory then guarantees that V, — V* a.s. if we 
establish (4.4), i.e., that sup, ||V;|]| < oo a.s. For this purpose, note that 
the corresponding o.d.e. (4.6) is simply (4.11) with f(-) = 0, for which 
an analysis similar to that for (4.11) shows that the origin is the glob- 
ally asymptotically stable equilibrium. The stability test of [12] mentioned 
above then ensures the desired result. 

The scheme (4.10) combines the deterministic numerical method (4.9) 
with a Monte Carlo simulation and stochastic approximation to exploit the 
averaging effect of the latter. Note, however, that unlike pure Monte Carlo, 
it does conditional averaging instead of averaging. The determination of 
the desired stationary average from this is then a consequence of an alge- 
braic relationship between the two specified by (4.8). The net gain is that 
the part stochastic, part algebraic scheme has lower variance than pure 
Monte Carlo because of the simpler conditional expectations involved, and 
therefore more graceful convergence. But it has higher per iterate compu- 
tation. On the other hand, it has higher variance than the deterministic 
relative value iteration — the latter has zero variance! But then it has lower 
per iterate computation because it does only local updates and does away 
with the conditional expectation operation. (It is worth noting that in 
some applications, this is not even an option because the conditional prob- 
abilities are not explicitly available, only a simulation/experimental device 


“Here r(-)7 is seen to satisfy the o.d.e. *(t)n = (D(P — I) — n(io)L)r(t)n + (6 — Zig (t))n. 
Left-multiply both sides by e”. 
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that conforms to them is.) In this sense, the reinforcement learning algo- 
rithm (4.10) captures a trade-off between pure numerical methods and pure 
Monte Carlo. 

In the form stated above, however, the scheme also inherits one major 
drawback each from numerical methods and Monte Carlo when s is very 
large. From the former, it inherits the notorious ‘curse of dimensionality’. 
From the latter it inherits slow convergence due to slow approach to sta- 
tionarity already mentioned earlier, a consequence of the local movement 
of the underlying Markov chain and possible occurrence of ‘quasi-invariant’ 
subsets of the state space. This motivates two important modifications of 
the basic scheme. The first is aimed at countering the curse of dimension- 
ality and uses the notion of (linear) function approximation. That is, we 
approximate V* by a linear combination of a moderate number of basis 
functions (or features in the parlance of artificial intelligence, these can 
also be thought of as approximate sufficient statistics — see [8]). These are 
kept fixed and their weights are learnt through a learning scheme akin to 
the above, but lower dimensional than the original. We outline this in the 
next section. The second problem, that of slow mixing, can be alleviated 
by using the devices of conditional importance sampling from [2] or split 
sampling from [13], either separately or together. This will be described in 
section 4.6. 

Before doing so, however, we describe an important related develop- 
ment. Recall the notion of control variates ([3]). These are zero mean ran- 
dom variables {€,,} incorporated into the Monte Carlo scheme such that if 
we evaluate 


| N 
FW de (f (Xm) + &m) (4.13) 


instead of (4.7), it has lower variance but the same asymptotic limit as 
(4.7). Consider the choice 


bn = > PIXn)V*(3) —~V*(X,), n>0, 


where (V*, @) satisfy (4.8). Then f(Xn) +) = @ Vn and the variance is 
in fact zero! The catch, of course, is that we do not know V*. If, however, 
we have a good approximation thereof, these would serve as good control 
variates. This idea is pursued in [24], where the ‘V*’ for a limiting deter- 
ministic problem, the so called fluid limit, is used for arriving at a good 
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choice of control variates (see also [31], Chapter 11). The scheme described 
above could also be thought of as one that adaptively generates control 
variates. 


4.4. Function Approximation 


We now describe an approximation to the aforementioned scheme that tries 
to beat down the curse of dimensionality at the expense of an additional 
layer of approximation and the ensuing approximation error. Function ap- 
proximation based reinforcement learning schemes for precisely this prob- 
lem have been proposed and analyzed in [38] and [39]. The one sketched 
below, though similar in spirit, is new and appears here for the first time. 

The essential core of the scheme is the approximation V(i) ~ 
Bea r;o;(t), @ € S, where M is significantly smaller than s = |S|, 
g@; : S — R are basis functions prescribed a priori and {r;} are weights 
that need to be estimated. For a function g : S — FR, we shall use g to 
denote both the function itself and the vector [g(1),...,9(s)|”, depending 
on the context. For the specific problem under consideration, we impose 


: : d 
the special requirement that On = i. 2 =e. Let ® ar Peg lice ieseur, 


where i; = $j (t) and (i) = [yi,---, Vim’. Thus f = Ou,,e = Sug, 


where u; is the unit vector in the ith direction. Consider the ‘approximate 
Poisson equation’ 


@r = POr — Be+f. 


This is obtained by replacing V by ®r in (4.8). Left-multiplying this 
equation on both sides first by ®7 and then by (®7®)~!, we get 


r = (676)-187 (Por — Be+ f). 
Note that (y = Pr, B) then satisfies the ‘projected Poisson equation’ 
y = IlPy — Be + f, (4.14) 


where It “! 6(67 6)? is the projection onto the range of ®. (In par- 
ticular, II leaves e, f invariant, a fact used here.) This is our initial approx- 
imation of the original problem, which will be modified further in what 


bThis is a purely formal substitution, as clearly this equation will not have a solution 
unless V = ®r + a scalar multiple of e. 
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follows. By analogy with the preceding section, this suggests the algorithm 


nti =tn + a(n)[Bz*O(Xn) (67 (Xn4i)tn — (¢" (éo)rn)e + f(Xn)) — Tn]; 
(4.15) 
where 


ee le 
B, = Y> Xm)" (Xm): 
m=0 


n+1 


Also, io is a fixed state in Uz{j : dg (7) > 0}. Note that {B7'} can be 
computed iteratively by: B>' = (n + 1)B;', where 


Bot = pnt — Bn OKs) 6" (Xnt1) Br! 
" . 1+ 47 (Xn41)Bn'o(Xn41) 


This follows from the Sherman-Morrison-Woodbury formula ([22], p. 50). 
If B, is not invertible in early iterations, one may add to it 5 times the 
identity matrix for a small 6 > 0, or use pseudo-inverse in place of inverse. 
The error this causes will be asymptotically negligible because of the time 
averaging. By ergodicity of {X,}, By, ~ ®7D® as. 

The limiting o.d.e. can be written by inspection as 


i = (©? DO) 167 D( Por, — $7 (io)rie + f) — 1" 


Let 0 “! &(@7D8)-1O7D. Then e is invariant under I], ie., it is 
an eigenvector of II corresponding to the eigenvalue 1. To see this, note 
that VDI(VD)~' is the projection operator onto Range(WD®), whence 
J/De = VD¢2 is invariant under it. This is equivalent to the statement 
that e is invariant under II. A similar argument shows that f is invariant 


under II. Let y(t) “! Sr, Then the above o.d.e. leads to 
y(t) = ILPy(t) — yin (t)e + f — y(t). 
The following easily proven facts are from [38]: 


1 II is the projection to Range(®) w.r.t. the weighted norm || - || defined 


def : 1 
by [lzll> = (Ui n(@)a7)?. 
2 P is nonexpansive w.r.t. || -||p: ||Px||p < ||z||p. 


Note that P is nonexpansive w.r.t. the norm ||z||p, hence so is IIP. Also, 
e is its unique eigenvector corresponding to eigenvalue 1 and the remain- 
ing eigenvalues are in the interior of the unit circle. Thus I[TP — eu; = 
Il(P — eu) has eigenvalues strictly inside the unit disc of the complex 
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plane. (Recall that u; for each j denotes the unit vector in jth coordinate 
direction.) Finally, H(P — eu; ) — I then has eigenvalues with strictly 
negative real parts. These considerations lead to: 


e the map y — II(P — eu} )y+ f is a || - ||p-contraction and has a unique 
fixed point y*, and, 
e the above o.d.e. is a stable linear system and converges to y*. 


d ' , 5 
Thus rz — r* a (67)-'@7y*. Just as in the preceding section, an 


identical analysis of the o.d.e. with f replaced by the zero vector leads to 
a.s. boundedness of the iterates, leading to r, — r* as. The quantity 
¢' (i9)r* then serves as our estimate of 3. Note that we have solved the 
approximate Poisson equation 


y = lPy — e+ f 


and not (4.14). This change is caused by the specific sampling scheme we 
used. 


4.5. Estimating Stationary Distribution 


Here we consider the problem of estimating the stationary distribution of an 
irreducible finite state Markov chain. It is the right Perron-Frobenius eigen- 
vector of its transposed transition matrix, corresponding to the Perron- 
Frobenius eigenvalue 1. More generally, one can consider the problem of 
estimating the Perron-Frobenius eigenvector (say, ¢) of an irreducible non- 
negative matrix Q = |[qi;]], corresponding to the Perron-Frobenius eigen- 
value A*. Thus we have Qg = A*g. One standard numerical method for 
this is the ‘power method’: 


Qn+1 = Van 
Qan|’ 


beginning with some initial guess gg with strictly positive components. The 


(ae 


normalization on the right hand side is somewhat flexible, e.g., one may use 
Guta = 24m n> 1, io €S fixed, 
In (to) 


with qo(io) > 0. The following scheme, a special case of the learning algo- 
rithm studied in [11], may be considered a stochastic approximation version 
of this. It can also be viewed as a multiplicative analog of (4.9). We shall 
consider the unique choice of g for which G(i9) = A*. Write Q = LP where 
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L is a diagonal matrix with ith diagonal element ¢(7) = >; Gj and P is 
an irreducible stochastic matrix. Then 


&(Xn)dn(Xn+1) 


Onl?) = dn(J) T a(v(j, n) {Xn ~ i} | In (to) 


—an(3)], 58, 


with go in the positive orthant and {a(n)} satisfying the additional condi- 
tions stipulated in [10]. Under suitable conditions on the step sizes (see the 
discussion at the end of section 4.2 above), the limiting o.d.e. is 


ge Qa 
gio)“ ar(io) 


. _ LPa 


dt — dt- (4.16) 


This can be analyzed by first analyzing a ‘secondary o.d.e.’ 


— 2 — &. (4.17) 


Under our hypotheses, Q/A* — I has a unique normalized eigenvector q* 
corresponding to the eigenvalue zero, viz., the normalized Perron-Frobenius 
eigenvector of Q@. Then aqg* is an equilibrium point for this linear system 
for alla € R. All other eigenvalues are in the left half of the complex plane. 
Thus q@ — a*q* for an a* € FR that depends on the initial condition. In 
turn, one can show that for the same initial condition, q = w(t)q;(), where 


r(t) = | : aa’ w(t)  exp( | - (1 . ae. 


This is verified by first checking that ~(-)q;,.) does satisfy (4.16) and then 
invoking the uniqueness of the solution to the latter. Now argue as in 
Lemma 4.1 of [11] to conclude that for go € the positive orthant, q — q, a 
scalar multiple of g* corresponding to G(i9) = A*. 
Letting 
Ag) =! -9, 


dio 


one has h..(q) = —q, where the I.h.s. is defined as in (4.5). By the stability 
test of [12], we then have sup,, ||qn|| < co a.s. In view of the foregoing, this 
leads to qn — @ a.s. 

Once again, in view of the curse of dimensionality, we need to add an- 
other layer of approximation via the function approximation q ~ ®r, where 
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®,r are analogous to the preceding section. The function approximation 
variant of the above is 


&(Xn) (Xn) 7 (Xn41) _ Ir 
($7 (to)rn) VE - 


where B,, is as before and € > 0 is a small prescribed scalar introduced in 


nti =Tn + a(n) [Ba (4.18) 


order to avoid division by zero. We need the following additional restriction 
on the basis functions: 


(+) ¢1,.--,@m are orthogonal vectors in the positive cone of R*, and the 
submatrix of P corresponding to Ux {i : d¢ (7) > O} is irreducible. 


This scheme is a variant of the scheme proposed and analyzed in [5]. (It 
has one less explicit averaging operation. As a rule of thumb, any additional 
averaging makes the convergence more graceful at the expense of its speed.) 
Let A“ 67 DQ0,B “! ETD. The limiting o.d.e. is the same as that 
of [5], viz., 


4). (BUA ____)), 
= (Gap 1) (t). (4.19) 


For ¢ = 0, this reduces to 
. BOA 
r= Flor) ~ i 1 ¢te): (4.20) 


Let W @f /DO and M “! VDQVD. Then B = W'W. Let Il = 
W(W7W)-1W?. Setting Y(t) = Wr(t), (4.20) may be rewritten as 


TM — TIM 
Y(t) = (aE = 1) Y(t) = ( mide 1) Y(t). (4.21) 


Under (+), ILM can be shown to be a nonnegative matrix ({5]). Let y* 
denote the Perron-Frobenius eigenvalue of IM. Consider the associated 
secondary 0o.d.e. 


Y(t) = ( r) Y(t). (4.22) 


“y® 
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Let z* be the Perron-Frobenius eigenvector of IM uniquely specified by 
the condition z*(i9) = \/n(io)y*. The convergence of (4.22), and therefore 
that of (4.21), to z* is established by arguments analogous to those for 
(4.16), (4.17) (see [5] for details). Now let « > 0. One can use Theorem 
1, p. 339, of [25] to conclude that given any 6 > 0, there exists an «* > 0 
small enough such that for € € (0,e*), the corresponding Y(t) converges 
to the 6-neighborhood of z*. But once in a small neighborhood of z*, 
Yi, (t) © ./n(io)y* > 0. Hence, if e << *, (4.19) reduces to (4.20) in this 
neighborhood and therefore Y(t) > z*. In turn, one then has r, — r* = 
(W7W)-'W?2z* as. by familiar arguments. Once again, we have solved 
an approximation to the original eigenvalue problem, viz., the eigenvalue 
problem for ILM. ®r*, resp. 7* are then our approximations to g, A*. 

Recalling our original motivation of estimating stationary distribution of 
a Markov chain, note that Q in fact will be the transpose of its transition 
matrix and X* is a priori known to be one. Thus we may replace the 
denominator q,,(io) in the first term on the right hand side of the algorithm 
by 1. This is a linear iteration for which the limiting o.d.e. is in fact the 
secondary o.d.e. (4.17) with A* = 1. This is not, however, possible for 
(4.21), as y* may not be 1. 

We conclude this section by outlining a related problem. Consider the 
case when Q is an irreducible nonnegative and positive definite symmetric 
matrix. The problem once again is to find the Perron-Frobenius eigenvec- 
tor of A. This can be handled exactly as above, except that the Perron- 
Frobenius eigenvalue need not be one. An important special case is when 
Q = A’ A where A is the adjacency matrix of an irreducible graph. This 
corresponds to Kleinberg’s HITS scheme for web page ranking, an alter- 
native to the PageRank mentioned above ([27]). The Perron-Frobenius 
eigenvector then corresponds to the unique stationary distribution for a 
vertex-reinforced random walk on S with reinforcement matrix Q ([33]). 


4.6. Acceleration Techniques 


For reasons mentioned earlier, the convergence of these algorithms can be 
slow even after dimensionality reduction, purely due to the structure of the 
underlying Markov chain. We shall discuss two ideas that can be used to 
advantage for speeding things up. The first is that of conditional impor- 
tance sampling. This was introduced in [2] in the context of rare event 
simulation and has also been subsequently used in [39]. Recall that impor- 
tance sampling for evaluating a stationary average }),-5 f(i)n(i) amounts 
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to replacing (4.7) by 


_e 
W de f (Xm) Am; (4.23) 


where: 


e {X,,} is another Markov chain on S with the same initial law, with its 
transition matrix [[p(-|-)]]i,jes satisfying the condition: 


p(j|t) > 0 => pGlt) > OV i,j, (4.24) 


and, 


def n—-1 p(Xm+i|Xm) 
e An = Wn =0 Ree ee) 


is the likelihood ratio at time n. 
This is the simplest version, other more sophisticated variations are possible 
([3]). Clearly (4.23) is asymptotically unbiased, i.e., its mean approaches 
the desired mean }),-5 f(i)n(z) as N | oo. The idea is to choose {x.} 
to accelerate the convergence. This, however, can lead to inflating the 
variance, which needs to be carefully managed. See [3] for an account of 
this broad area. 

To motivate conditional importance sampling, note that our algorithms 
are of the form 


Tn+1(4) = Ln (7) 
+a(n)I{Xn = t}[Fx, 41 (@n) + Gi(@n) + Mn4i + r4+ild); 
1. <4.< d, (4.25) 


where d > 1,F = [F,,...,Ful,g = [g1,---,ga] : R? — RF? are pre- 
scribed maps, {M,} is a martingale difference sequence as before, and 
Cn = [Cn(1),---, Gn(d)],n > 1, an asymptotically negligible ‘error’ sequence, 
i.e., G, ~ 0 as. The o.d.e. limit is 


&(t) = D(PF(x(t)) + g(a(t)))- 
The idea here is to replace {X,} by {X,,} as above and (4.25) by 


Wnts)? 


4+Mnai + Gnai(i)], l<i<d. (4.26) 


tmai(i) = n(i) + a(n){Xnar = i} Fy, ,, (tn) + 9:(@n) 
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This will have the o.d.e. limit 


#(t) = D(PF(a(t)) + g(#(t))), 
where D is the diagonal matrix with ith diagonal element = 7j(i) “SF the 
stationary probability of state i for {X,,}. The advantages are: 


1 D can then be tailored to be a more suitable matrix (e.g., a scalar multiple 
of identity) by an appropriate choice of p(-|-). D # D can, however, be 
a problem when the convergence of the algorithm (at least theoretically) 
critically depends on having D in place, see, e.g., section 4.4 above. 

2 As already noted, a major reason for slow mixing is the occurrence of 
nearly invariant sets. For the reversible case, a neat theoretical basis 
for this behaviour is available in terms of the spectrum of transition 
matrices - see, e.g., [15]. To work around this, {X,,} can be chosen to be 
rapidly mixing, i.e., the laws of x converge to the stationary law 7) much 
faster than the corresponding rate for convergence of laws of the original 
chain {X,,} to 7. This can be achieved by increasing the probability of 
links which connect the nearly invariant sets to each other, so that the 
chain moves across their boundary more often. In addition, one may also 
introduce new edges which enhance mixing. This means that we relax 
(4.24) to 


pjlt) > 0 => pUlt) > OV 4,7. 


See, e.g., [23], which discusses such a scenario in the context of MCMC. In 
particular, the new edges can be chosen to alleviate the problems caused 
by local movement of the chain, by introducing the so called ‘long jumps’. 
There will, however, be tradeoffs involved. For example, too many such 
transitions, which do not contribute anything to the learning process 
explicitly (because the r.h.s. of (4.26) is then zero), will in fact slow it 
down. Furthermore, there is often significantly higher computational cost 
associated with simulating these ‘long jumps’. 

3 An important difference between this and the traditional importance 
sampling is that this involves only a one step likelihood ratio for sin- 
gle transitions, not a full likelihood ratio accumulated over time from a 
regeneration time or otherwise. This tremendously reduces the problem 
of high variance, at the cost of higher per iterate computation. 


This scheme, however, does require that the transition probabilities are 
known in an explicit analytic form, or are easy to compute or estimate. This 


88 V. S. Borkar 


need not always be the case. Another scheme which does away with this 
need is the split sampling proposed in [13], which does so at the expense of 
essentially doubling the simulation budget. Here we generate at each time n 
two S-valued random variables, Y,, and Z,,, such that {Y;,} is an irreducible 
Markov chain on S with transition matrix (say) [[q(-|-)]]| and stationary 
distribution y, and Z, is conditionally independent of {Z,¥m;m < nb}, 
given Y,,, with conditional law P(Z,, = j|Yn = 7) = p(j|t) V 2,7. Then the 
scheme is 


En41(4) = In(i)+a(n){Yn = oh [Fz,, ()+9i(@)+MnsitGr4i(i)], Sis d. 
(4.27) 
The limiting o.d.e. is 


a(t) = D(PF(a(t)) + g(x(t))), 

where D isa diagonal matrix whose ith diagonal element is y(7). In addition 
to the advantage of not requiring the explicit knowledge of p(-|-) (though we 
do require a simulator that can simulate the transitions governed by p(-|-)), 
this also decouples the issues of mixing and that of obtaining the correct 
transition probabilities on the right hand side of the limiting 0.d.e. We can 
choose {Y,,}, e.g., to be a rapidly mixing Markov chain with uniform sta- 
tionary distribution (see, e.g., [32]), or even i.i.d. uniform random variables 
when they are easy to generate (unfortunately this is not always an easy 
task when the dimension is very high, as computer scientists well know). 
This scheme does away with the need for conditional importance sampling 
for speed-up, though one could also consider a combined scheme that com- 
bines both, i.e., a scheme which sets P(Z, = j|Yn = 1) = p(j|t) V i,7 in 
addition to the above and replaces (4.27) by 


P(Zn|Yn) 
P(Zn|Yn) 


+ 9:(n) +P Mri + Cai (2)], 1 s a < d. (4.28) 


rysali) = tall) + a(n)( ) 1m = BLP 2n (an) 


4.7. Future Directions 


This line of research is quite young and naturally has ample research op- 
portunities. There are the usual issues that go with stochastic algorithms, 
such as convergence rates, sample complexity, etc. In addition, one im- 
portant theme not addressed in this article is that of obtaining good error 
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bounds for function approximations. While some of the references cited 
above do have something to say about this, it is still a rather open area. 
Its difficulty is further compounded by the fact that the approximation 
error will depend on the choice of basis functions and there are no clear 
guidelines for this choice, though some beginnings have been made. These 
include clustering techniques based on graph clustering ({29], [30]) and ran- 
dom projections ([4]). Finally, the theoretical convergence proof of some 
schemes crucially depends on the states being sampled according to the 
stationary distribution of the Markov chain, which is inconvenient for the 
acceleration techniques mentioned above. There is a need for making the 
schemes robust vis-a-vis the sampling strategy. 
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5.1. Introduction 


Let G be a locally compact second countable group. We denote by P(G) the 
space of all probability measures on G equipped with the weakx-topology 
with respect to bounded continuous functions, viz. a sequence {j;} in 
P(G) converges to 4 in P(G), and we write pu; > p, if [ pd; — ydy for all 
bounded continuous functions y. On P(G) we have the convolution product 
* of measures making it a topological semigroup; namely (A, 2) +> A * ps is 
a continuous map of P(G) x P(G) into P(G). 

For g € G let 6, denote the point mass concentrated at g, namely the 
probability measure such that 6,(£) = 1 for a Borel subset F of G if and 
only if g € E. It can be seen that {d, | g € G} is a closed subset of P(G) and 
the map g + dg gives an embedding of G as a closed subset of P(G), which 
is also a homomorphism of the semigroup G into the semigroup P(G). 


Notation 5.1. In the sequel we suppress the notation * (as is common in 
the area) and write the product A * w of A, uw € P(G) as Ap, and similarly 
for n > 2 the n-fold product of w € P(G) by yw”. Also, for any g € G 
and  € P(G) we shall write gy for 6, * w and similarly ps * 6, by pg. 
In view of the observations above this change in notation is unambiguous. 
For A € P(G) we denote by supp A the support of A, namely the smallest 
closed subset with measure 1. For any closed subgroup H of G we shall 
also denote by P(#) the subsemigroup of P(G) consisting of all A € P(G) 
such that supp A is contained in H. 
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With regard to the definitions and the discussion in the sequel it would 
be convenient to bear in mind the following connection between probability 
measures and “random walks” on G. To each yp € P(G) there corresponds a 
random walk on G with yu as its transition probability, namely the Markov 
process such that for any a € G and any measurable subset EF of G the 
probability of starting at a € G and landing in aF (in one step) equals 


p(B). 


Definition 5.1. Let pp € P(G). 
i) A probability measure \ € P(G) is called a nth root of pu if A” = p. 
ii) A probability measure o € P(G) is called a factor of jy if there exists 
p € P(G) such that po = op = wu. 


It may be noted that a factor in the above sense is a “two sided factor” in 
the semigroup structure of P(G). We will not have an occasion to consider 
one-sided factors in the usual sense, and the term factor will consistently 
mean a two-sided factor. 


Remark 5.1. i) Every root of yz is a factor of . On the other hand 
in general factors form a much larger class of measures, even for point 
measures. 

ii) Given g € G, pp € P(G) is a factor of 5, only if pp = 6, for some h € G 
which commutes with g; if furthermore jz is a root of 6, then the element 
h is a root of g in G. 


Remark 5.2. If \ is a nth root of uw, n > 2, then the random walk cor- 
responding to pz is the n-step iterate of the random walk corresponding to 
the nth root A. Similarly factorisation of 41 corresponds to factorisation of 
the corresponding random walks. 


The main aim of this article is to discuss results about the sets of roots 
and factors of probability measures. Much of the study of these was inspired 
by the so called embedding problem, which we will now recall. 


Definition 5.2. A probability measure p is said to be infinitely divisible if 
it has nth roots for all natural numbers n. 


In the (algebraic) study of semigroups an element is said to be 
“divisible” if it has roots of all orders, and the term “infinitely” as above is 
redundant, but in probability theory it has been a tradition, since the early 
years of classical probability to use the phrase “infinitely divisible” . 
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Definition 5.3. A family {j}:50 of probability measures on G is called a 
one-parameter convolution semigroup if Us+t = sie for all s,t > 0, and it 
is said to be a continuous one-parameter convolution semigroup if the map 
tr pt, t > 0, is continuous. 

A probability measure p is said to be embeddable if there exists a con- 
tinuous one-parameter convolution semigroup {j;} such that w= p11. 


Every embeddable measure is infinitely divisible, since given jz = py; in 
a one-parameter convolution semigroup {/1:}:50, for any n > 2, fy jn 8 a 
nth root of py. 

There is a rich analytic theory for embeddable measures obtaining in 
particular a Lévy-Khintchine representation theorem for these measures. 
Such a theory was developed by G.A. Hunt in the case of connected Lie 
groups, and has been extended to locally compact groups by Heyer, Hazod 
and Siebert (see [9]). 

In the light of Hunt’s theory, K.R. Parthasarathy ([13]) raised the ques- 
tion whether one can embed a given infinitely divisible probability measure 
in a one-parameter convolution semigroup, in particular to obtain a Lévy- 
Khintchine representation for it; this would of course involve some extra 
condition on G, since infinite divisibility does not always imply embeddabil- 
ity; e.g. if G is the group of rational numbers with the discrete topology 
then 6, is infinitely divisible, but it cannot be embeddable. Indeed, for the 
classical groups R?, d > 1, every infinitely divisible probability measure 
is embeddable. A locally compact group G is said to have the embedding 
property if every infinitely divisible probability measure on G is embeddable. 
It was shown in [13] that compact groups have the embedding property; an 
analogous result was also proved for measures on symmetric spaces of non- 
compact type, but we shall not be concerned with it here. Parthasarathy’s 
work inspired a folklore conjecture that every connected Lie group has the 
embedding property. This conjecture is not yet fully settled, though it is 
now known to be true for a large class of Lie groups. We will discuss the 
details in this respect in the last section. 

We will conclude this section by recalling a result which illustrates how 
the study of the set of roots plays a role in the embedding problem. 


Definition 5.4. A probability measure p is said to be strongly root compact 
if the set {A* | A” = pw for some n € N,1 < k < n}, has compact closure in 
P(G). 
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Theorem 5.1. Let G be a Lie group. If 4 € P(G) is infinitely divisible 
and strongly root compact then it 1s embeddable. 


Such a result is known, in place of Lie groups, also for a larger class of 
locally compact groups; see [12], Corollary 3.7. 

In view of Theorem 5.1 to prove that a Lie group G has the embedding 
property it suffices to show the following: given yw € P(G) infinitely divisible 
there exists a closed subgroup H of G and a root v of jz such that supp v is 
contained in H and, viewed as a probability measure on H, it is infinitely 
divisible and strongly root compact. Proving existence of such H and v has 
been one of the strategies for proving the embedding theorem. 


5.2. Some Basic Properties of Factors and Roots 


Let G be a locally compact second countable group and pw € P(G). We 
begin by introducing some notation associated with jz. We denote by G() 
the smallest closed subgroup containing supp ju, or equivalently the smallest 
closed subgroup whose complement has measure 0. Let N(j) denote the 
normaliser of G(j) in G, namely 


N(u) = {9 € G| gxg* € G(p) for all x € G(u)}. 


Then N(j1) is a closed subgroup of G. 
The following is an interesting simple lemma. 


Lemma 5.1. Let w € P(G) and X be a factor of 4. Then supp X is con- 
tained in N(1). 


Proof. Let v € P(G) be such that w = Av = vr. Then we have 
supp = (suppA)(suppv) = (suppv)(suppA). Let g € suppA and 
consider any « € (suppv)(suppA), say « = yz with y € suppy and 
z € suppr. Then grg~! = gyzg '. Picking any w € suppy we can 
therefore write grg~! as (gy)(zw)(gw)~*. As gy, zw and gw are contained 
in (supp \)(suppv) C supp pu this shows that grg~! € G(y). As this holds 
for all x € (suppv)(supp X) and the latter set is dense in supp p it follows 
that gzg~' is contained in G(y) for all x € suppp, and in turn for all 


x € G(w). Hence supp 4 is contained in N(y). 


Note that G(j1) is a closed normal subgroup of N(j1) and we may form 
the quotient group N(w)/G(u). Let p: N(u) — N(u)/G() be the quotient 
homomorphism. Consider the image p(j1) of win N(w)/G(). It is the point 
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mass at the identity element in N(j1)/G(). Let be any factor of w. Then 
p(A) is a factor of p(z), and since the latter is a point mass so is p(A). Hence 
there exists g € N(1) such that supp A is contained in gG(w) = G(w)g. If 
furthermore 4 is a root of 4, say A” = pw, then p(A)” = p(w) and in this 
case g as above is such that p(g)" is the identity, so g” € G(ju). These 
observations may be summed up as follows: 


Lemma 5.2. Jf is a factor of 4s then there exist g € N(w) andao € 
P(G()) such that A = og. If furthermore \” = yw for some n € N then 
g” € G(u). 


Let » be a root of uw, say A” = pw with n E N. Let g € N(p) and 
ao € P(G()), as obtained above, such that g” € G(w) and X = og. Then 
we have 


w= NP (og) - o(gog')(g?0g~7) eae (ggg Ug". 


Let 0, : G(u) — G() be the automorphism defined by 9,(x) = gag~' for 
all x € G(1); note that a+ O,(a), a € P(G(u)), defines a homomorphism 
of the P(G()). From the identity we see that 


Lemma 5.3. X is a nth root of uw if and only if it is of the form ag with 
a € P(G(u)) and g € N(u) such that 0O,(c)---O7-1(a)g" = pw. 


The point about this characterisation is that the relation in the conclu- 
sion is entirely within G(j) on which the measure p lives. The measure 
o may be viewed as an “affine nth root” of 4 in G(j), depending on the 
automorphism 0, and the translating element g” from G(j). It is more 
convenient when the translating element g” is the identity element. The 
notion of affine nth root in this sense is studied in [7] (the results there 
have some consequences to the embedding problem, which however are be- 
yond the scope of the present article). In general it may not be possible to 
choose the element g (in its G(j) coset in N(y)) to be such that g” is the 
identity element. However there are many natural situations in which this 
is possible. 

We denote by F(z) the set of all factors of in P(G). The next result 
is about sequences in F'(w), and in particular shows that F'(jy) is a closed 
set. It may be noted here that for a given n € N the set of all n th roots can 
be readily seen to be a closed set. On the other hand the set of all roots of 
jt is in general not closed, as can be seen, for example, from the fact that 
the roots of unity form a dense subset of the circle group. 
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Proposition 5.1. Let {\;} be a sequence in F(t), and let {v;} in P(G) 

be such that \yYy;,= YjAG = EW for all 1. Then there exists a sequence 

{x;} in N() such that the following holds: the sequences {x;A\;}, {Aixi}, 
-1 


{a> 'v;} and {42,1} have compact closures in P(G); in turn the sequences 


{a;px,'} and {x;'pa;} are contained in a compact subset of P(G). 


The first part of the assertion may be seen from the proof of Propo- 
sition 1.2 in [2]; (the statement of the Proposition there is not in this 
form, but the proof includes this assertion along the way); see also [1], 
Proposition 4.2 for another presentation of the proof. The assertion about 
{a,px,'} and {x;'pa;} follows immediately from the first statement, and 
the relation between the measures: indeed, aja;! = (x;:A;)(V4ix7"), and 


ay pa, = (a; '%)(Yj2;'), yields the desired conclusion. 


Corollary 5.1. F (yu) is a closed subset of P(G). 


Proof. Let {\;} be a sequence in F(1) converging to A € P(G), and 
{v;} be such that Ay; = YA; = pw for all i. Let {x;} be a sequence in 
N() as in Proposition 5.1. Since {\;} and {A;2;} have compact closures, 
it follows that {x;} has compact closure in N(y). In turn, together with the 
fact that {vy;2,'} has compact closure this implies that {;} has compact 
closure. Passing to a subsequence we may assume that it converges, to say 
vy € P(G). Then Av = vv = p, and hence A € F(), which shows that 
F(t) is closed. O 


5.3. Factor Sets 


Let G be a locally compact group and yp € P(G). In this section we will 
discuss the factor set of , under certain conditions on G. As before we 
denote the set of factors of by F(1). Let 


Z(u) ={g € G| gx = 2g for all x € supp py}, 


the centraliser of supp y (or equivalently of G(j)) in G. Also let 


T(u) ={g EG | gu = pg}. 


Then Z(j:) and Tw) are closed subgroups and Z(,1) is contained in T (1). 
We note that if \ is a factor of 4 then for any x € T(z), xd is also a factor 
of yw; if v is such that wp = Av = vd then (xA)(v2—') = zpae! 
(va2—*)(@d). 

Thus T(w) (and also Z()) act on the space F'(w) (viewed with the 
subspace topology from P(G)). The key question is how large are the 
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quotient spaces F'(w)/T(4), F(4)/Z(), and specifically whether there exist 
compact sets of representatives for the actions. 

In the light of Proposition 5.1 the question is related to understanding 
sequences {a;} in G such that {x;ya;'} and {x;'a;} are contained in 
compact sets. Consider the action of G on P(G), with the action of g € G, 
which we shall denote by ®,, given by \ + gAg~'. Then {®,,(w)} and 
{® ari (H)} are contained in compact sets, and we want to know whether 
this implies that {x;Z(1)} is relatively compact in G/Z(j), or {x;T'(w)} is 
relatively compact in G/T(j). A general scheme for studying asymptotic 
behaviour of measures under the action of a sequence of automorphisms of 
the group is discussed in [1], where a variety of applications are indicated, 
including to the study of the factor sets of measures. Questions involving 
orbit behaviour typically have better accessibility in the framework alge- 
braic groups, and in the present instance also the known results are based 
on techniques from the area. We shall now briefly recall the set up, in a 
relatively simpler form, and then the results. 

Let G be a subgroup of GL(d, R), d > 2. Then G is said to be algebraic 
if there exists a polynomial P(2;;) in d? variables x;;, i,j = 1,...,d, such 
that G = {(gi;) | P(gi;) = 0}; (normally, over a general field, one takes 
a set of polynomials, but over the field of real numbers one polynomial 
suffices). Also, it is said to be almost algebraic if it is an open subgroup of 
an algebraic subgroup. Almost algebraic subgroups form a rich class of Lie 
groups. To that end we may mention that given a connected Lie subgroup 
G of SL(d,R) the smallest almost algebraic subgroup G containing G is 
such that G is a normal subgroup of G, G /G is isomorphic to R* for some 
k, and [G,G] = [G,G]; in particular if G; and G2 are two connected Lie 
subgroups of GL(d,R) whose commutator subgroups are different then the 
corresponding almost algebraic subgroups are distinct. In the sequel we 
will suppress the inclusion of the groups G in GL(d, R) as above, and think 
of them independently as “almost algebraic groups”, the GL(d,R) being in 
the background. 

A connected Lie group is said to be W-algebraic if i) AdG is an almost 
algebraic subgroup of GL(6), where © is the Lie algebra of G and ii) 
for any compact subgroup C’ contained in the centre of G and x € G, 
{g€C|g=xyx'y~! for some y € G} is finite. The class of W-algebraic 
groups includes all almost algebraic connected Lie groups, all connected 
semisimple Lie groups and also, more generally, all connected Lie groups 
whose nilradical is simply connected. 

The following result was proved in [6]. 
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Theorem 5.2. Let G be a W-algebraic group and uw € P(G). Then 
F(t)/Z() is compact. In particular, F(w)/T (us) and T()/Z() are com- 
pact. 


For the case of almost algebraic groups (which in particular are W- 
algebraic) a proof of the theorem may be found in [4]. The question was also 
studied earlier in [3] and the same conclusion was upheld under a condition 
termed as “weakly algebraic”, but the method there is more involved. 

There are examples to show that if either of the conditions in Theo- 
rem 5.2 do not hold then the conclusion F(j1)/Z() is compact need not 
hold; see [3]. Let us only recall here the example pertaining to the second 
condition (in a slightly modified form than in [3}). 


Example 5.1. Let H be the Heisenberg group consisting of 3 x 3 upper 
triangular unipotent matrices. Let Z be the (one-dimensional) centre of H 
and D be a nonzero cyclic subgroup of Z. Let G = H/D. Then T = Z/D 
is a compact subgroup forming the center of G, and G/T is topologically 
isomorphic to R?. On G we can have a probability measure yz which is 
invariant under the action of T by translations, and such that G(j) = G. 
Then for any g € G the T-invariant probability measure supported on gT 
is a factor of w. On the other hand, since G() = G, Z(w) = T. It follows 
that F'(w)/Z(~) cannot be compact. 


In all the known examples where F'(y)/Z(j) is not compact for a prob- 
ability measure 4 on connected Lie group G, the construction involves in 
fact that T(y)/Z(w) is non-compact. It is not known whether there exist 
a connected Lie group G and a yw € P(G) such that F(j)/T() is non- 
compact. 


Remark 5.3. Conditions under which {a;2; '} can be relatively compact 
for a probability measure and a sequence {x;} in the group are not well 
understood. When this holds for a y for a sequence {x;} not contained 
in a compact subset, fz is said to be collapsible. Some partial results were 
obtained on this question in [8], and were applied in the study of decay of 
concentration functions of convolution powers of probability measures. 


5.4. Compactness 


Let G be a locally compact second countable group, and let  € P(G). 
The set R(w) = {A* | A” = pw for some n € N,1 < k < n} is called the 
root set of ys. Recall that ps is said to be strongly root compact if the root 
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set is relatively compact and, by Theorem 5.1, when this holds infinitely 
divisibility of «4 implies embeddability. Clearly R(1) is contained in the 
factor set F'(j1), so ys is strongly root compact when F'(jy) is compact. In 
the light of the results of the previous section we have the following. 


Corollary 5.2. Let G be a W-algebraic group and let yp be such that Z(p1) 
is compact. Then ps is strongly root compact. In particular, if G is an 
almost algebraic group with compact center then any 4 € P(G) such that 
G(u) = G (namely such that supp 4 is not contained in any proper closed 
subgroup), is strongly root compact. 


The following proposition enables extending the class of measures for 
which strong root compactness holds (see [11], Proposition 8). 


Proposition 5.2. Let G and H be locally compact second countable groups 
and suppose there is a continuous surjective homomorphism wy: H — G 
such that the kernel of w is a compactly generated subgroup contained in 
the center of H. Let v € P(H) and X be a subset of R(v) such that (X) 
is relatively compact in P(G). Then X is a relatively compact subset of 


P(G). 


Corollary 5.3. Let G be a connected Lie group and suppose that there 
exists a closed subgroup Z contained in the center such that G/Z is topolog- 
ically isomorphic to a W-algebraic group. Letn: G— G/Z be the quotient 
homomorphism. If u~ € P(G) is such that n() is strongly root compact 
then ys is strongly root compact. 


We note in this respect that every closed subgroup contained in the 
center of a connected Lie subgroup is compactly generated (see [10]), so 
Proposition 5.2 applies. 

By an inductive argument using the above corollary one can prove the 
following. 


Corollary 5.4. i) If G is a connected nilpotent Lie group then every wu € 
P(G) is strongly root compact. 

ii) If G is an almost algebraic group and pw € P(G) is such that supp 
is not contained in any proper almost algebraic subgroup of G then ju is 
strongly root compact. 


It may be mentioned that there is also a group theoretic condition called 
Boge strong root compactness which implies strong root compactness of 
every measure on the group. Various groups including compact groups, 
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connected nilpotent Lie groups and also connected solvable groups G' for 
which all eigenvalues of Adg, g € G, are real (in this case G is said to have 
real roots) are known to satisfy the Boge strong root compactness condition. 
The reader is referred to [9], Chapter III for details; see also [12]. 


5.5. Roots 


Let G be a locally compact group and let w € P(G). We now discuss the 
set of roots of yw. For n € N we denote by R,() the set of nth roots of pu, 
namely A € P(G) such that A” = w. 

Let Z(w) and T(y) be the subgroups as before. We note that if A € 
Rn(u) and g € T(u) then g\g™* € Rnr(u), since (gAg™*)” = gA"gt = 
gug ' = p; (note that a translate of a root by an element of T() or Z(j) 
need not be a root). Thus we have an action of T(1) on each R,(w), with 
the action of g € T() given by \ + gAg™! for all A € Ry(p). The object 
in this case will be to understand the quotient space of R,(j:) under this 
action. 

Let us first discuss a special case. Let G = SL(d,C) and yw = 67, where I 
is the identity matrix. Then for n € N, R,(jw) consists of {6; |x € G,x” = 
I}. Every x in this is diagonalisable, viz. has the form gdg~', for some 
g € G and d = diag (o1,...,0a), with each 0; a nth root of unity; there 
are only finitely many of these diagonal matrices. Since ~ = 67, we have 
Z(t) = G, so the diagonal matrices as above form a set of representatives 
for the quotient space of R,,(j) under the action of T(j1) defined above. In 
particular the quotient space is finite. Analogous assertion holds for any 
point mass over an algebraic group. 

The following theorem provides a generalisation of this picture in the 
special case, to more general probability measures, over a class of Lie groups 
G. In the general case the quotient is shown to be a compact set (in place 
of being finite in the special case). The condition that we need on G is 
described in the following Proposition (see [5], Proposition 2.5). 


Proposition 5.3. Let G be a connected Lie group. Then the following 
conditions are equivalent. 

i) there exists a representation p: G— GL(d,R) for some d € N such 
that the kernel of p is discrete. 

ii) if R is the radical of G then |R,R] is a closed simply connected 
nilpotent Lie subgroup. 


A connected Lie group satisfying either of the equivalent conditions is 


Measures on Lie Groups 103 


said to be of class C. Groups of class C include all linear groups, all simply 
connected Lie groups (by Ado’s theorem), and all semisimple Lie groups 
(through the adjoint representation), and thus form a large class. 


Remark 5.4. We note however that not all connected Lie groups are of 
class C. An example of this may be given as follows. Let H and D be 
an in Example 5.1 and let N = H/D. Then N is not of class C; in fact 
it can be shown that for any finite-dimensional representation of N, the 
one-dimensional center T = Z/D of N is contained in the kernel of the 
representation. An example of a non-nilpotent Lie group which is not of 
class C may be constructed from this as follows. Note that N/T is iso- 
morphic to R?. The group of rotations of R? extends, over the quotient 
homomorphism of N onto N/T, to a group of automorphisms of N, say 
C. Let G = C.-N, the semidirect product. Then G is not of class C. 
Similarly SL(2,IR), viewed as a group of automorphisms of R? extends to 
a group of automorphisms of N and the corresponding semidirect product 
is a non-solvable Lie group that is not of class C. 


For any \ € P(G) we denote by Z7°(A) the connected component of the 
identity in Z(A). 


Theorem 5.3. Let G be a connected Lie group of class C and let 1 € P(G). 
Let n € N and {A;} be a sequence in R(t). Then there exists a sequence 
{z;} in Z°(w) such that {z;\;z) 1} is relatively compact. Moreover, the 
sequence {z;} has also the property that ifm € N and {v;} is a sequence in 
Rinn(w) such that vl =r; and Z°(v;) = Z°(A;) for all i, then {242 '} 
is relatively compact. 


Let n € N and let ~ denote the equivalence relation on R,,(j) defined by 
Aw~ N, for A,r’ € Ry(u) if there exists a g € Z°(w) such that ’ = gAg™. 
Then the first assertion in the theorem shows in particular that the quotient 
space R,,(41)/~ is compact. The point in second statement is that the same 
zs work for n th as well as mnth roots, and this can be seen to be useful 
in taking “inverse limits”. Such limits appear in the proof of the embedding 
theorem in [5]. 

We will now sketch a part of the proof of the theorem, recalling some key 
ingredients, which could be of independent interest. For simplicity we shall 
assume that G an almost algebraic group. Let n € N and a sequence {A;} in 
R,(u) be given. By Theorem 5.2 there exists a sequence {x;} in Z(1) such 
that {A;2;} is relatively compact. Thus we have a sequence of translates 
by elements of Z(,1) forming a relatively compact set. Our aim would be to 
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find a sequence of conjugates contained in a compact set. To achieve this 
we proceed as follows. As {\;2;} is relatively compact, so is {(A;2;)"}. Re- 
call that each A; can be written as o;y;, with o; € P(G(u)) and y; € N(p). 
Then for amy 7, Quai)” = (ey)? = (Gries) (Cie) = - (Cree; ), and 
a straightforward computation using that « € Z(w) shows that (o;y;27;)” 
equals (a;yi)"(ai(xj)a?(a;)-+-a%(x;)), where a; is the automorphism of 
Z(m) defined by aj(z) = yizy;* for all z € Z(w). Therefore (A;2;)" = 
Mi (ai (ai)az(wi)-+- oP (ai) = wlai(wi)az(ai)---aP(ai)). As {Awi} has 
compact closure in P(G) it follows that {(a;(2;)a?(x;)---a"(a;))} is rela- 
tively compact in Z(). 

We now recall an interesting property of nilpotent Lie groups, called 


“affine root rigidity”. 


Theorem 5.4. Let N be a connected nilpotent Lie group. Let n € N 
and {a;} be a sequence of automorphisms of N such that a? = I, the 
identity automorphism of N, for alli € N. Let {a;} be a sequence in 
N such that {(a;(a;)a?(x;)---al(a;))} is relatively compact. Then there 
exists a sequence {€;} in N such that {€>'ax;} is relatively compact and 
aj (E;)a? (&)--- a? (&) =e, the identity element of N. 


While a priori the subgroup Z(j,) need not be nilpotent it turns out that 
using some structure theory of almost algebraic groups one can reduce to 
the case when the sequence {z;} as above is contained in the nilradical of 
Z°(), so this theorem can be applied to the above context. Following the 
computation backwards one can now see that {A;&;} is relatively compact 
and moreover (\;&;)"” = 4, namely the translates A;&; are roots of p. 

This brings us to another interesting fact, again involving nilpotent Lie 
groups: 


Theorem 5.5. Let G be a locally compact second countable group and let 
pu € P(G). Let N be a subgroup of Z() which is isomorphic to a simply 
connected nilpotent Lie group, and normalized by N(w). Letn € N, A € 
P(G), and z € N be such that (A€)" = A” = pw. Then there exists ¢ € N 
such that AE = Cpco!. 


This shows that the translates we had are also conjugates by another 
sequence from the same subgroup. This proves the first assertion in The- 
orem 5.3. The second part involves keeping track of how the conjugations 
operate when we go to higher roots, using again certain properties of nilpo- 
tent Lie groups. 
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5.6. One-Parameter Semigroups 


In this section we discuss one-parameter semigroups and embeddings. One 
may think of “one-parameter semigroups” parametrised either by positive 
reals, or just by positive rationals (parametrisation by other subsemigroups 
of positive reals may also be considered, but we shall not go into that here). 
Via the study of the factor sets of measures the following was proved in [3]. 


Theorem 5.6. Let G be a connected Lie group. Then any homomorphism 
y of the semigroup Q* (positive rationals, under addition) into P(G) ez- 
tends to a homomorphism of R* (positive reals) to P(G). 


This reduces the embedding problem for infinitely divisible measures to 
finding a rational embedding. (Actually our proof of the embedding result 
in [5] does not make serious use of this, but it would help to see the problem 
in perspective). It may also be noted that the task of finding a rational 
embedding of 4 € P(G) is equivalent to finding a sequence A; in P(G) such 
that x = p and i = \,_ for all k; this produces a homomorphism from 
Q* to P(G) given by 2 + A47", While by infinite divisibility . admits 
k! th roots for all k, we need to have them matching as above; an arbitrarily 
picked k! th root may a priori not have any nontrivial roots at all. 

Let us now come to the embedding problem, i.e. embedding a given 
infinitely divisible probability measure yz, on a Lie group, in a continuous 
one-parameter semigroup. Recall that by Theorem 5.1 if jz is strongly root 
compact then fz is embeddable, and in particular the conclusion holds for 
the strongly root compact measures as noted in §4. In particular it was 
known by the 1970’s that all nilpotent Lie groups and solvable Lie groups 
with real roots have the embedding property. The reader is referred to [9] 
for details. In [4] it was shown that all connected Lie groups of class C (see 
§5) have the embedding property. A simpler and more transparent proof 
of the result was given in [5] using Theorem 5.3. 


Theorem 5.7. Let G be a Lie group of class C. Let 4 € P(G) be infinitely 
divisible and let r : N + P(G) be a map such that r(m)™ = pu for allm € N. 
Then there exist sequences {m;} in N and {z;} in Z°(u), andn €N such 
that n divides m; for alli and the sequence {zr(m;)™/"z; |} (consisting of 
n th roots of 4) converges to an th root v of 4 which is infinitely divisible and 
strongly root compact on the subgroup Z(Z°(v)), the centraliser of Z°(v) in 
G. Hence ps is embeddable. 
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An overall idea of the proof is as follows. A subset M of N is said to 
be infinitely divisible if for every k € N there exists m € M such that k 
divides m. Let M = {m;} be an infinitely divisible set and n € N. By 
infinite divisibility of 4 we can find a sequence p; in P(G) such that for 
all i, p; is a mnth root of yw. Then p;” is a nth root of w for all i. By 
Theorem 5.3 there exists a sequence {z;} in Z°() such that {2;p7"'z>"} 
is relatively compact. Note that any limit point of the sequence is a nth 
root of ~ which is infinitely divisible in G. We need the limit to be such 
that it is simultaneously infinitely divisible and strongly root compact in a 
suitable subgroup of H. For this we need to pick the set M as above and n 
suitably, which involves in particular analysing how Z°(\) changes over the 
roots A, of higher and higher order. For full details the reader is referred 
to the proof in original ([5]). 

We conclude with some miscellaneous comments concerning the status 
of the embedding problem. 
i) For a general connected Lie group G, not necessarily of class C, we get a 
“weak embedding theorem” : 


Theorem 5.8. Let G be a connected Lie group and 1 € P(G) be infinitely 
divisible. Let T the maximal compact subgroup of |[R,R], where R is the 
solvable radical of G. Let p: G — G/T the quotient homomorphism, and 
let M = p-!(Z°(p())). Then there exists a sequence {¢;} in M such that 
CiLlG, ' converges to an embeddable measure. 


This can be deduced from Theorem 5.7, using the fact that G/T as 
above is of class C (see [5]). 
ii) The embedding problem has been studied also for various other classes 
of groups: discrete subgroups of GL(d, R) (Dani and McCrudden), Finitely 
generated subgroups of GL(n, A), where A is the field of algebraic num- 
bers (Dani and Riddhi Shah), p-adic groups (Riddhi Shah, McCrudden - 
Walker); see [12] for some details and references. 
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Higher criticism has been proposed as a tool for highly multiple hypoth- 
esis testing or signal detection, initially in cases where the distribution 
of a test statistic (or the noise in a signal) is known and the component 
tests are statistically independent. In this paper we explore the ex- 
tent to which the assumptions of known distribution and independence 
can be relaxed, and we consider too the application of higher criticism 
to classification. It is shown that effective distribution approximations 
can be achieved by using a threshold approach; that is, by disregarding 
data components unless their significance level exceeds a sufficiently high 
value. This method exploits the good relative accuracy of approxima- 
tions to light-tailed distributions. In particular, it can be effective when 
the true distribution is founded on something like a Studentised mean, 
or on an average of related type, which is commonly the case in practice. 
The issue of dependence among vector components is also shown not to 
be a serious difficulty in many circumstances. 


6.1. Introduction 


Donoho and Jin (cf. [8]) developed higher-criticism methods for hypothesis 
testing and signal detection. Their methods are founded on the assumption 
that the test statistics are independent and, under the null hypothesis, 
have a known normal distribution. However, in some applications of higher 
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criticism, for example to more elaborate hypothesis testing problems and 
to classification, these assumptions may not be tenable. For example, we 
may have to estimate the distributions from data, by pooling information 
from components that have “neighbouring” indices, and the assumption of 
independence may be violated. 

Taken together, these difficulties place obstacles in the way of using 
higher-criticism methods for a variety of applications, even though those 
techniques have potential performance advantages. We describe the effects 
that distribution approximation and data dependence can have on results, 
and suggest ways of alleviating problems caused by distribution approx- 
imation. We show too that thresholding, where only deviations above a 
particular value are considered, can produce distinct performance gains. 
Thresholding permits the experimenter to exploit the greater relative accu- 
racy of distribution approximations in the tails of a distribution, compared 
with accuracy towards the distribution’s centre, and thereby to reduce the 
tendency of approximation errors to accumulate. Our theoretical argu- 
ments take sample size to be fixed and the number of dimensions, p, to be 
arbitrarily large. 

Thresholding makes it possible to use rather crude distribution approx- 
imations. In particular, it permits the approximations to be based on rela- 
tively small sample sizes, either through pooling data from a small number 
of nearby indices, or by using normal approximations based on averages of 
relatively small datasets. Without thresholding, the distribution approxi- 
mations used to construct higher-criticism signal detectors and classifiers 
would have to be virtually root-p consistent. 

We shall provide theoretical underpinning for these ideas, and explore 
them numerically; and we shall demonstrate that higher criticism can ac- 
commodate significant amounts of local dependence, without being seri- 
ously impaired. We shall further show that, under quite general condi- 
tions, the higher-criticism statistic can be decomposed into two parts, of 
which one is stochastic and of smaller order than p* for any positive e, 
and the other is purely deterministic and admits a simple, explicit formula. 
This simplicity enables the effectiveness of higher criticism to be explored 
quite generally, for distributions where the distribution tails are heavy, and 
also for distributions that have relatively light tails, perhaps through be- 
ing convolutions of heavy-tailed distributions. These comments apply to 
applications to both signal detection and classification. 
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In the contexts of independence and signal detection, [8] used an 
approach alternative to that discussed above. They employed delicate, 
empirical-process methods to develop a careful approximation, on the 
VJVloglogp scale, to the null distribution of the higher-criticism statistic. 
It is unclear from their work whether the delicacy of the log-log approxima- 
tion is essential, or whether significant latitude is available for computing 
critical points. We shall show that quite crude bounds can in fact be used, 
in both the dependent and independent cases. Indeed, any critical point on 
a scale that is of smaller order than p‘, for each € > 0, is appropriate. 

Higher-criticism methods for signal detection have their roots in un- 
published work of Tukey; see [8] for discussion. Optimal, but more tightly 
specialised, methods for signal detection were developed by [17-19] and [20], 
broadly in the context of techniques for multiple comparison (see e.g. the 
methods of Bonferroni, [30] and [26]), for simultaneous hypothesis testing 
(e.g. [9] and [23]) and for moderating false-discovery rates (e.g. [2], [28] 
and [1]). Model-based approaches to the analysis of high-dimensional mi- 
croarray data include those of [29], [16], [12] and [11]. Related work on 
higher criticism includes that of [25] and [4]. Higher-criticism classification 
has been discussed by [15], although this work assumed that test statistic 
distributions are known exactly. Applications of higher criticism to sig- 
nal detection in astronomy include those of [22], [5, 6], [7] and [21]. [14] 
discussed properties of higher criticism under long-range dependence. 

Our main theoretical results are as follows. ‘Theorem 6.1, in sec- 
tion 6.3.1, gives conditions under which the higher-criticism statistic, based 
on a general approximation to the unknown test distributions, can be ex- 
pressed in terms of its “ideal” form where the distributions are known, plus 
a negligible remainder. This result requires no assumptions about indepen- 
dence. Theorem 6.2, in section 6.3.2, gives conditions on the strength of 
dependence under which the higher-criticism statistic can be expressed as 
a purely deterministic quantity plus a negligible remainder. Theorem 6.3, 
in section 6.3.3, describes properties of the deterministic “main term” in 
the previous result. Discussion in sections 6.3.3 and 6.4 draws these three 
results together, and shows that they lead to a variety of properties of sig- 
nal detectors and classifiers based on higher criticism. These properties are 
explored numerically in section 6.5. 
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6.2. Methodology 


6.2.1. Higher-criticism signal detection 


Assume we observe independent random variables Z1,..., Zp), where each 
Z; is normally distributed with mean jy; and unit variance. We wish to 
test, or at least to assess the validity of, the null hypothesis Hp that each 
jt; equals v;, a known quantity, versus the alternative hypothesis that one 
or more of the yz; are different from v;. If each v; equals zero then this 
context models signal detection problems where the null hypothesis states 
that the signal is comprised entirely of white noise, and the alternative 
hypothesis indicates that a nondegenerate signal is present. 

A higher-criticism approach to signal detection and hypothesis testing, 
a two-sided version of a suggestion by [8], can be based on the statistic 


Pp 
= 4 —1/2 4 es 7 
he= inf ¥(u) y {I(|Zj —¥j|<u)—Y(u)}, (6-1) 
where Y(u) = 2 ®(u) — 1 is the distribution function of |Z; —v;| under Ho, 
® is the standard normal distribution function, w(u) = pV(u) {1 — U(u)} 
equals the variance of >7, {1(|Z; — vj| < u) — U(u)} under Ho, and C isa 
positive constant. 

The statistic at (6.1) provides a way of assessing the statistical signifi- 
cance of p tests of significance. In particular, Ho is rejected if hc takes too 
large a negative value. This test enjoys optimality properties, in that it is 
able to detect the presence of nonzero values of 4; up to levels of sparsity 
and amplitude that are so high and so low, respectively, that no test can 
distinguish between the null and alternative hypotheses (([8]). 


6.2.2. Generalising and adapting to an unknown null 
distribution 


When employed in the context of hypothesis testing (where the v;s are not 
necessarily equal to zero), higher-criticism could be used in more general 
settings, where the centered Z;s are not identically distributed. Further, 
instead of assuming that the v;s are prespecified, they could be taken equal 
to the jth component of the empirical mean of a set of nw identically 
distributed random p-vectors Wj,...,Wny, where W; = (Wi1,...,Wip) 
has the distribution of (Z1,...,Z,) under the null hypothesis Ho. Here, 
Ho asserts the equality of the mean components of the vector Z, and of 


the vectors W1,...,Wny, whose distribution is known except for the mean 


Higher Criticism 113 


which is estimated by its empirical news gaa There, we could redefine he 
by replacing, in (6.1), v; by Wj; = ny yy W,,; and W by the distribution 
Uy;, say, of |Z; — W.;| under the aul hypothesis. This gives, in place of 
he at (6.1), 


hew = inf vw(u) 1? (IZ; —W.;|<u)-—WUwj(u)}, (6.2) 
where 
vw (u) = ds Vw (u) {1 — Uw;(u)} (6.3) 


and, given C,t > 0, Uw = Uw(C,t) is the set of u for which u > t and 
ww(u) > C. Here t denotes a threshold, and the fact that, in the definition 
of Uw, we confine attention to u > t, means that we restrict ourselves to 
values of u for which distribution approximations are relatively accurate; 
see section 6.4.2 for details. 

Further, in practical applications it is often unrealistic to argue that 
W (respectively, Vw), is known exactly, and we should replace W in (6.1) 
and in wy (respectively, Wy; in (6.2) and in py) by an estimator U of U 
(respectively, Uw; of Ww,;). This leads to an empirical approximation, 
he = he(C, £); te he: 


he = inf d(u 85> (|Z, — 4] <0) U(u)}, 


ucu 


ak 
where wu )= pw(u —~ V(u )} and U = U(C,t) is the set of u for which 
u >t and w(u ) > C, and to an empirical approximation hew = hew (C, ts 
to hey: 


hew = inf dw(u 18 5 fx |Z; —Wj|<u)—WVw;(u)}, (6.4) 


ucuy 


where 
bw(u) = > Wy; (u) {1 — Vy; (u)} (6.5) 


and Une = Uw (C, t) is the set of u for which u > t and dw (u) > C. Here t 
denotes a threshold, and the fact that, in the definition of Uw, we confine 
attention to u > t, means that we restrict ourselves to values of u for which 
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distribution approximations are relatively accurate; see section 6.4.2 for 
details. 

Estimators of Uy, are, broadly speaking, of two types: either they 
depend strongly on the data, or they depend on the data only through the 
way these have been collected, for instance through sample size. In the 
first case, Uw; might, for example, be computed directly from the data, 
for example by pooling vector components for nearby values of the index 7. 
(In genomic examples, “nearby” does not necessarily mean close in terms 
of position on the chromosome; it is often more effectively defined in other 
ways, for example in the sense of gene pathways.) 

Examples of the second type come from an important class of problems 
where the variables W;; are obtained by averaging other data. For example, 
they can represent Studentised means, W;; = Ny Owi; /Swij, or Student 
t statistics for two-sample tests, Wi; = Me (Ow — Uwij2)/(Stiz.a 4 
Seago) !?, where, fori = 1,...,nw and j = 1,...,p, Uwij,, and SWij.k 
k = 1,2 denote respectively the empirical mean and empirical variance of 
Nw, independent and identically distributed data; or statistics computed 
in a related way. See e.g. [24], [27], [13] and [10]. In such cases, if Z; and 
Wi;,---,Wnj were identically distributed, 7; -W_; would be approximately 
normally distributed with variance Tw = 1+ Ni 3 and Vw; would be the 
distribution function of the normal N(0,7yw) distribution, not depending 
on the index j, and depending on the data only through the number ny 
of observations. See section 6.3.3 for theoretical properties for this type of 
data. 

Formula (6.5), giving an empirical approximation to the variance of the 
series on the right-hand side of (6.4), might seem to suggest that, despite 
the increased generality we are capturing by using empirical approxima- 
tions to the distribution functions Vy;, we are continuing to assume that 
the vector components Wj;,...,Wip, are independent. However, indepen- 
dence is not essential. By choosing the threshold, t, introduced earlier in 
this section, to be sufficiently large, the independence assumption can be 
removed while retaining the validity of the variance approximation at (6.5). 
See section 6.4.2. 


6.2.3. Classifiers based on higher criticism 


The generality of the higher-criticism approximation in section 6.2.2 leads 
directly to higher-criticism methods for classification. To define the classi- 
fication problem, assume we have data, in the form of independent random 
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samples of p-dimensional vectors ¥ = {X1,..., Xn, } from population Ix, 
and Y = {Yi,..., Yn, } from population Ily, and a new, independent ob- 
servation, Z, from either Ilx or Ily. (In our theoretical work the sample 
sizes nx and ny will be kept fixed as p increases.) We wish to assign Z to 
one of the populations. In the conventional case where p is small relative 
to sample size, many different techniques have been developed for solving 
this problem. However, in the setting in which we are interested, where p 
is large and the sample size is small, these methods can be ineffective, and 
better classification algorithms can be obtained by using methods particu- 
larly adapted to the detection of sparse signals. 

Let X;;, Yi; and Z; denote the jth components of X;, Y; and Z, re- 
spectively. Assume that wx; = E(X;;) and wy; = E(Yi;) do not depend 
on 2, that the distributions of the components are absolutely continuous, 
and that the distributions of the vectors (Xi,1 — wx1,..-,Xijp — Xp); 
(Yi,1 —by1,---, Yigp —Hyp) and (Z1 — E(Z1),..., Zp — E(Zp)) are identical 
to one another, for 1 <7; <nx andl <ig< ny. 

In particular, for each 71, 72 and 7 the distributions of X;,; and Y;,; dif- 
fer only in location. This assumption serves to motivate methodology, and 
is a convenient platform for theoretical arguments. Of course, many other 
settings can be addressed, but they are arguably best treated using their 
intrinsic features. Instances of particular interest include those where each 
component distribution is similar to a Studentised mean. A particular rep- 
resentation of this type, involving only location changes, will be discussed 
extensively in section 6.3.3. Other variants, where non-zero location also 
entails changes in shape, can be treated using similar arguments, provided 
the shape-changes can be parametrised. 

With W denoting X or Y we shall write Vj; for a random variable 
having the distribution of Z; — W,;, given that Z is drawn from Il. If 
nx = ny then the distribution of Vy; depends only on j, not on choice 
of W. Let X ; = nx! >>, Xi; and define Vy; analogously. Let Uy; 
be the distribution function of |Vw,|, and put Aw,;(u) = (|Z; —W,| < 
u) = Uw; (u). 7 

If Z is from Hw then, for each j, |Z; — W.;| has distribution function 
Ww,, and so, for each fixed u, Aw,;(u) has expected value zero. On the 
other hand, since the distributions of X;; and Y;; may differ in location, 
then, if Z is not from Uw, P(|Z; — W;| < u) may take a lesser value 
than it does when Z is from Iw, with the result that the expected value of 
Aw,(u) can be strictly negative. Provided an estimator of Vy; is available 
for W = X and W = Y, this property can be used to motivate a classifier. 
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In particular, defining he x_and hey as at (6.4), we should classify Z as 
coming from IIx if he ag hey, and as coming from Ily otherwise. 


6.3. Theoretical Properties 


6.3.1. Effectiveness of approximation to hcw by hew 


We start by studying the effectiveness of the approximation by hew to 
hey, where hey and hey are defined as at (6.2) and (6.4), respectively 
(arguments similar to those given here can be used to demonstrate the 
effectiveness of the approximation by he to hc). To embed the case of 
hypothesis testing in that of classification, we express the problem of hy- 
pothesis testing as one where the vector Z comes from a population IIz, 
equal to either ILx or Ilw, and where the data vectors W1,...,Wn,y, come 
from Ilw, with lw = Iz under Ho, Iw = IIr otherwise, and (Iz, Ir) 
denoting one of (IIx, Ily) or (ly, ILx). We assume throughout section 6.3 
that each Uw; is, with probability 1, a continuous distribution function 
satisfying Vw; (a) —Oasa|0and Vw; (z) — las xf oo. We also make 
the following additional assumptions. 


Condition A 


(A1) The threshold, t = t(p), varies with p in such a manner that: For each 
C > 0 and for W = X and Y, supyewy,(c2) vw (u)71/? Dg |Vw,;(w) — 
Ww3(u)| = op(1). 

(A2) For a constant uo > 0, for each of the two choices of W, and for all 
sufficiently large p, Ww is nonincreasing and strictly positive on [uo, co); 
and the probability that ow is nonincreasing on |[uo,0o) converges to 1 as 
pO. 


The reasonableness of (A1) is taken up in section 6.4.2, below. The first 
part of (A2) says merely that ~w inherits the monotonicity properties of 
its component parts, Vy; (1 — Vw,;). Indeed, if Vy; is the distribution 
function of a distribution that has unbounded support, then Vy; (1—Vw;) 
is nonincreasing and strictly positive on [uo, co) for some up > 0, and (A2) 
asks that the same be true of pw = >); Uw, (1 — Yw;). This is of course 
trivial if the distributions Vy, are identical. The second part of (A2) 
states that the same is true of the estimator, ow, of ww, which condition 
is satisfied if, for example, the observations represent Studentised means. 

The next theorem shows that, under sufficient conditions, hew is an 
effective approximation to hcyw. Note that we make no assumptions about 
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independence of vector components, or about the population from which Z 
comes. In particular, the theorem is valid for data drawn under both the 
null and alternative hypotheses. 


Theorem 6.1. Let 0 < Cy < Co < C3 < w and0 < C < C3, and assume 
that ww (t) > C3 for all sufficiently large p. If (Al) and (A2) hold then, 
with W =X ory, 


hew(C,t)={1+0,(1)} | inf = pw(u)7”? 
ucuy (C,t) 


x Ds {I(|Z; —W.5| <u) — Uw3(u)} + op(1), (6.6) 


hew (Ci, t) + o(1) < {1 + op(1)} hew (C2, t) + op(1) 
< {1+ 0,(1)} hew (Cs, t) + op(1). (6.7) 


We shall see in the next section that, in many cases of interest, when Z 
is not drawn from Ilw, the higher-criticism statistic hcw tends, with high 
probability, to be negative and does not converge to zero as p — oo. Our 
results in the next section also imply that, when Z comes from IIw, the 
last, added remainders o0,(1) on the far right-hand sides of (6.6) and (6.7) 
are of smaller order than the earlier quantities on the right. Together, these 
properties justify approximating hcw by hey. 


6.3.2. Removing the assumption of independence 


We now study the properties of higher-criticism statistics in cases where 
the components are not independent. To illustrate the type of dependence 
that we have in mind, let us consider the case where Z is drawn from Iw, 
and the variables V; = Z; — W; form a mixing sequence with exponen- 
tially rapidly decreasing mixing coefficients. The case where the mixing 
coefficients decrease only polynomially fast, as functions of p, can also be 
treated; see Remark 6.3. 

To give a specific example, note that the cases of moving-average pro- 
cesses or autoregressions, of arbitrary (including infinite) order, fall nat- 
urally into the setting of exponentially fast mixing. Indeed, assume for 
simplicity that the variables V; form a stochastic process, not necessarily 
stationary, that is representable as 


CoO 
Ve= y Onn Sj=k 5 
ia 
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where the aj;;,’s are constants satisfying |a;,| < const. p* for all 7 and k, 
0 < p <1, and the disturbances €; are independent with zero means and 
uniformly bounded variances. Given c > 1, let @ denote the integer part of 
c log p, and put 


Q 
/ 
=} te Ge 
k=1 


Then, by Markov’s inequality, 


le. @) 


2 
P(V;—V;| >) s we >, Cie &-+] = O(u-? p**), 
Roel 


uniformly in u > 0, c > 1 and integers 7. By taking u = p~© for C > 0 
arbitrarily large, and then choosing c > 3 C'|log p|~', we deduce that the 
approximants V; have the following two properties: (a) P(|V; — Vj| < 
p-°) =1—O(p~°), uniformly in 1 < j < p; and (b) for each r in the 
range 2 < r < p, and each sequence 1 < j; < ... < 7, < p satisfying 
jk+i — jk = clogp+1 for1<k <r -—1, the variables Vj , forl <k <r, 
are stochastically independent. 

The regularity condition (B1), below, captures this behaviour in greater 
generality. There, we let Vij, for 1 < 7 < p, have the joint distribution of 
the respective values of Z; — W;; when Z is drawn from IIw. At (B2), we 
also impose (a) a uniform Hélder smoothness condition on the respective 
distribution functions yw; of Vw,;, (b) a symmetry condition on xw,;, and 
(c) a restriction which prevents the upper tail of Vyw,;, for each j and W, 
from being pathologically heavy. 


Condition B 


(B1) For each C,e > 0, and each of the two choices of W, there exists 
a sequence of random variables Vj,;, for 1 < j < p, with the properties: 
(a) P(lVw; —Vyjl < p-°) =1—O(p-2), uniformly in 1 < j < p; and 
(b) for all sufficiently large p, for each r in the range 2 < r < p, and each 
sequence 1 <j; <... <j, < p satisfying 74,41 — jx > p* forl<k<r-1, 
the variables Vivijn> for 1 <<k <r, are stochastically independent. 

(B2) (a) For each of the two choices of W there exist constants C), C2 > 0, 
the former small and the latter large, such that |xwj(ui) — xwj(u2)| < 
Co |\uz — u2|“, uniformly in u1,u2 >0,1<j<p<coandW=X orY; 
(b) the function Gw;(u,v) = P(|Vw; + v| < u) is nonincreasing in |v| for 
each fixed u, each choice of W and each j; and (c) max;<;<, {1—Vw,;(u)} = 
O(u-*), for W = X,Y and for some € > 0. 


Higher Criticism 119 


Part (b) of (B2) holds if each distribution of Viv; is symmetric and 
unimodal. 

As explained in the previous section, in both the hypothesis testing and 
classification problems we can consider that W = X or Y, indicating the 
population from which we draw the sample against which we check the new 
data value Z. Let wz = px if Z is from Ilx, and pz = py otherwise, and 
define vwzj = Uzj — bw; 


hewz(C,t)= sup vw(u)71/”? 
u€Uw (C,t) 


x $7 {Pil < u) — P(\Vws + wzjl <u)}- (6.8) 


In view of (B2)(b), the quantity within braces in the definition of heyz 
is nonnegative, and so hewz > 0. Theorem 6.2 below describes the extent 
to which the statistic hcw, a random variable, can be approximated by the 
deterministic quantity heyz. 


Theorem 6.2. Let C > 0 be fixed, and take the threshold, t = t(p), to 
satisfy t > 0 and ww(t) > C, thus ensuring that Uw(C,t) is nonempty. 
Let hew and hewz denote hew(C,t) and hewz(C,t), respectively. If (B1) 
and (B2) hold then for each « > 0, 


hew = —{1+ 0,(1)} hewz + Op(p*) - (6.9) 


An attractive feature of (6.9) is that it separates the “stochastic” and 
“deterministic” effects of the higher-criticism statistic hcy. The stochastic 
effects go into the term O,(p‘). The deterministic effects are represented 
by hcywz. When the data value Z is from the same population Iw as the 
dataset with which it is compared, each vy z; = 0 and so, by (6.8), hewz = 
0. Property (6.9) therefore implies that, when Z is from Hw, hew = O,(p‘) 
for each € > 0. In other cases, where Z is drawn from a population different 
from that from which come the data with which Z is compared, hcyz is 
generally nonzero. In such instances the properties of hcyw can be computed 
by relatively straightforward, deterministic calculations based on hey z. In 
particular, when W ¢ Z, if hewz is of order larger than p* for some 
€ > 0 (see (6.13) below), then it follows directly that the probability of 
rejecting the null hypothesis, in the hypothesis testing problem, or of correct 
classification, in the classification problem, converges to 1. See, for example, 
section 6.3.3. 
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Remark 6.1. Sharpening the term O,(p‘) in (6.9). If, as in the problem 
treated by [8], the distribution functions Vy; are all identical and the 
variables X;,;, and Yj,;,, for 1 <i; <nx,1<itg<ny and1< ji, jo <p, 
are completely independent, then a refinement of the argument leading to 
(6.9) shows that the O,(p‘) term there can be reduced to O,(/log p). Here 
it is not necessary to assume that the common distributions are normal. 
Indeed, in that context [8] noted that the O,(p‘) term in (6.9) can be 


replaced by O,(./log log p). 


Remark 6.2. Relaxing the monotonicity condition (B2)(b). Assumption 
(B2)(b) asks that Gw;(u,v) = P(|\Vw; +v| < u) be nonincreasing in |v| for 
each u. However, if the distributions of X;; and Y;; are identical for all but 
at most q values of 7 then it is sufficient to ask that, for these particular 7, 
it be possible to write, for each € > 0, 


Gw;(u,v) = Hw;(u,v) + ofp’! dw(u)'7}, 


uniformly in 1 <j <p,u>tand W =X and Y, where each Hy; has the 
monotonicity property asked of Gy; in (B2)(b). 


Remark 6.3. Mixing at polynomial rate. The exponential-like mixing rate 
implied by (B1) is a consequence of the fact that (a) and (b) in (B1) hold 
for each C,e > 0. If, instead, those properties apply only for a particular 
positive pair C,¢, then (6.9) continues to hold with p* there replaced by 
p", where 7 > 0 depends on C,e from (B1), and decreases to zero as C 
increases and € decreases. 


6.3.3. Delineating good performance 


Theorem 6.2 gives a simple representation of the higher-criticism statistic. 
It implies that, if Z is drawn from Ilg where (W,Q) = (X,Y) or (Y,X), 
and if hcwz exceeds a constant multiple of p* for some € > 0, then the 
probability that we make the correct decision in either a hypothesis testing 
or classification problem, Z converges to 1 as p — oo. We shall use this 
result to determine a region where hypothesis testing, or classification, are 
possible. For simplicity, in this section we shall assume that each j1x; = 0, 
and py; = 0 for all but q values of j, for which wy; = v > 0 and v = v(p) 
diverges with p and does not depend on j. The explicit form of hewz, 
at (6.8), makes it possible to handle many other settings, but a zero-or- 
v representation of each mean difference permits an insightful comparison 
with results discussed by [8]. 
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In principle, two cases are of interest, where the tails of the distribution 
of Vw; decrease polynomially or exponentially fast, respectively. However, 
in the polynomial case it can be proved using (6.8) that the hypothesis test- 
ing and classification problems are relatively simple. Therefore, we study 
only the exponential setting. In this context, and reflecting the discussion 
in section 6.2.2, we take the distribution of Vj; to be that of the difference 
between two Studentised means, standardised by dividing by V2, and the 
distributions of X;; and Y;; to represent translations of that distribution. 
See (C1) and (C2) below. Alternatively we could work with the case where 
Xj; is a Studentised mean for a distribution with zero mean, and Yj; is 
computed similarly but for the case where the expected value is shifted 
by tv. Theoretical arguments in the latter setting are almost identical to 
those given here, being based on results of [32]. 


Condition C 


(C1) For each pair (W,7), where W = X or Y and 1 < j < p, let Uwjr, 
for 1 < k < Nywj, denote random variables that are independent and 
identically distributed as Uw,;, where E(Uw;) = 0, E(Uw;) is bounded 
uniformly in (W, 7), E(U?, ;) is bounded away from zero uniformly in (W, j), 
and Nw; > 2. Let Vw; have the distribution 2-1/2 times the difference 
between two independent copies of No Uw;/Sw j, where Uw; and Sw; 
denote respectively the empirical mean and variance of the data Uwyj1,..., 
Uwjnw;- Take wx; = 0 for each j, wy; = 0 for all but q = q(p) values of 
j, SAY ji,---5Jq, and |wy;| = v for these particular values of j. 


(C2) The quantity v in (C1) is given by v = \/2w logp, and the threshold, 
t, satisfies B < t < \/2s logp for some B,s > 0, where 0 < w < 1 and 
0<s < min(4w, 1). 

The setting described by (C1) is one where a working statistician would, 
in practice, generally take each distribution approximation Vw; (uw) to be 
simply P(|€| < u), where € has the standard normal distribution. The 
signal detection boundary in this setting is obtained using a polynomial 
model for the number of added shifts: 


for some $ <B<1, q~const.p!~* (6.10) 


(see [8]). The boundary is then determined by: 


B-% ifs<6<8 
w (6.11) 
(1-V1—f) if?<B< 


122 A. Delaigle and P. Hall 


The inequality (6.11) is also sufficient in hypothesis testing and classifi- 
cation problems where the data are exactly normally distributed. Likewise 
it is valid if we use a normal approximation and if that approximation is 
good enough. The question we shall address is, “how good is good enough?” 
The following theorem answers this question in cases where Ny, diverges 
at at least a logarithmic rate, as a function of p. The proof is given in 
Appendix A.2. 


Theorem 6.3. Assume (C1), (C2), (6.10), that w satisfies (6.11), and 
that, for W =X or Y and1 <j <p, Nw;, given in (C1), satisfies 


Nw; (log p)* — 0. (6.12) 


Suppose too that Z is from Ilg, where (W,Q) = (X,Y) or (Y,X). Then, 
for constants B,n > 0, 


hewz > Bp". (6.13) 


Condition (6.12) confirms that the samples on which the coordinate 
data are based need be only logarithmically large, as a function of p, in 
order for the higher-criticism classifier to be able to detect the difference 
between the W and Q populations. 


6.4. Further Results 


6.4.1. Alternative constructions of hcw and hew 


There are several other ways of constructing higher-criticism statistics when 
the distribution functions Vy; depend on j and have to be estimated. For 
example, omitting for simplicity the threshold t, we could re-define hcw as: 


Pp 
her =P git OOO 22 Wa) Sea 

‘ (6.14) 
If Z were drawn from Iw then the random variable K = }/, [{|Z; — 
W;|< Uy; (w)} would have exactly a binomial Bi(p, u) distribution. The 
normalisation in formula (6.14) for hew reflects this property. However, 
replacing K by K = »; IZ; - Wj| < Wy, (u)}, as in (6.14), destroys 
the independence of the summands, and makes normalisation problematic. 


This is particularly true when, as would commonly be the case in practice, 
the estimators Vy; are computed from data W;;, for values of j; that are 
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local to 7. In such cases the estimators Uw; would not be root-p consistent 
for the respective distributions Vy;. 

If the distribution of |Z; — W. j| were known up to its standard deviation, 
ow;, and if we had an estimator, ¢w;, of ow; for each W and j, then we 
could construct a third version of hey: 


P 
7 Z —1/2 + A 

cr ee ow (u)—V/ ys {1(1Z; — W5l/6ws <u) — Ow3(u)}, 
where ®y; denotes the distribution function of |Z; — W.;|/ow, under the 
assumption that Z is drawn from Hw, and dw(u) = >; ®w; (1 — w;). 
Again, however, the correlation induced through estimation, this time the 
estimation of ow,;, makes the normalisation difficult to justify. 

In some problems there is good reason to believe that if the marginal 
means of the populations IIx and Ily differ, then the differences are of a 
particular sign. For example, it might be known that fix; > wy; for all 7. 
In this case we would alter the construction of the higher-criticism statistics 
hey and hey, at (6.2) and (6.4), to: 


os _ so: os —1/2 7 os 
het = inf, dip (u)-/ 2 {I(Zj —W5 <u) —Viyj(w)}, (6-15) 
—~OS nm P —< mn 
hey = inf py(u)/? So {1(Z; -W; <u) — UF, (u)}, (6.16) 
ucuys j= 
respectively, where 
Pp me Pp me _ 
w(u) = >> Wij (u) (1-Y(u)}, YR (u) = SO WH, (u) (1-YH,(w)}, 
j=l j=l 


ves ;(u) is an empirical approximation to the probability Uy ;(u) = P(Zj— 

W.; <u) when Z is drawn from Tw, Up? = UpF(C,t) is the set of u for 

which u > t, wy(u) > C, Up? is defined analogously, and the superscript 
—~OS 

“os” denotes “one-sided.” When using hey, we would classify Z as coming 


—~OS —~OS 
from IIx if hcy > hey, and as coming from [y otherwise. 


Remark 6.4. Adapting Theorems 6.1 and 6.2. Theorems 6.1 and 6.2 
have direct analogues, formulated in the obvious manner, for the one-sided 
classifiers hey and hcy, introduced above. In particular, the one-sided 
version of hcwz, at (6.8), is obtained by removing the absolute value signs 
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there. The regularity conditions too differ in only minor respects. For 
example, when formulating the appropriate version of (A1) we replace Vw ;, 
Vw;, Yw and ow by Uy;, Ui;, Vw and vy, respectively. Part (b) of 
(B2) can be dropped on this occasion, since its analogue in the one-sided 
case follows directly from the monotonicity of a distribution function. 


6.4.2. Advantages of incorporating the threshold 


By taking the threshold, t, large we can construct the higher-criticism statis- 
tics hcw and hew, at (6.2) and (6.4), so that they emphasise relatively large 
values of |Z; -W_;|. This is potentially advantageous, especially when work- 
ing with hew, since we expect the value of u at which the infimum at (6.4) 
is achieved also to be large. 

The most important reasons for thresholding are more subtle than this 
argument would suggest, however. They are founded on properties of rela- 
tive errors in distribution approximations, and on the fact that the divisor 
in (6.2) is wi, not simply ww. To see why this is significant, consider the 
case where the distribution functions Vy; are all identical, to V say. Then 
vw = pV (1—W), which we estimate by ww = pw (1 - v), say. In order 
for the effect of replacing each Vy ;(u) (appearing in (6.2)) by Vw; (uw) (in 
(6.4)) to be asymptotically negligible, we require the quantity 


Po 1/2)G(u) — U(u 


to be small. Equivalently, if w is in the upper tail of the distribution V, we 
need the ratio 


p'/? |W(u) — Y(u)| 


{1 vy? eal 


to be small. 


If the approximation of UV by U (or more particularly, of 1—W by 1— Vv) 
is accurate in a relative sense, as it is (for example) if V is the distribution 
of a Studentised mean, then, for large wu, 


_ |W(u) = Y(u)| 
plu) = —— Wu) 


(6.18) 
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is small for u in the upper tail as well as for u in the middle of the distribu- 
tion. When u is in the upper tail, so that 1— (uw) is small, then, comparing 
(6.17) and (6.18), we see that we do not require p(w) to be as small as it 
would have to be in the middle of the distribution. By insisting that u > t, 
where the threshold t is relatively large, we force u to be in the upper tail, 
thus obtaining the advantage mentioned in the previous sentence. 

Below, we show in more detail why, if thresholding is not undertaken, 
that is, if we do not choose ¢t large when applying the higher-criticism 
classifier, substantial errors can occur when using the classifier. They arise 
through an accumulation of errors in the approximation Vw; ~~ Vy;. 

Commonly, the approximation of Vy; by Vw; can be expressed as 


Ww (u) = Vw; (u) + 5p aw;(u) + (5p), (6.19) 


where 6, decreases to zero as p increases and represents the accuracy of the 
approximation; ayy; is a function, which may not depend on j; and the 
remainder, o(4,), denotes higher-order terms. Even if ay; depends on J, 
its contribution cannot be expected to “average out” of hew, by some sort 
of law-of-large-numbers effect, as we sum over 7. 

In some problems the size of 5, is determined by the number of data 
used to construct Wy;. For example, in the analysis of gene-expression 
data, Vw j might be calculated by borrowing information from neighbouring 
values of 7. In order for this method to be adaptive, only a small proportion 
of genes would be defined as neighbours for any particular 7, and so a 
theoretical description of 6, would take that quantity to be no smaller than 
p", for a small constant 7 > 0. In particular, assuming that Vw; was 
root-p consistent for Vy, i.e. taking 7 as large as 4, would be out of the 
question. 

In other problems the coordinate data X;; and Y;; can plausibly be taken 
as approximately normally distributed, since they are based on Student’s 
t statistics. See sections 6.2.2 and 6.3.3 for discussion. In such cases the 
size of 6, is determined by the number of data in samples from which the t 
statistic is computed. This would also be much less than p, and so again a 
mathematical account of the size of 6, would have it no smaller than p~", 
for 7 > 0 much less than 4. 

Against this background; and taking, for simplicity, VU = Vy;, a = aw 
and 6, = p_"; we find that bw ~ bw =pU (1—W) and >, (Vw;—Uw;) = 
p'"a+o(p'—"). These results, and (6.19), lead to the conclusion that, for 
fixed u, the argument of the infimum in the definition of hew, at (6.4), is 
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A(u) = bw(u)7'/? 2, {I(|Z; — W.3| < u) — Vw;(u)} 


= {1+ 0(1)} dw(u ap (|Z; —W5| <u) — Vw;(u)} 
— p/2)-.7(u) + o(pO/2-7) , (6.20) 


where y = a{W(1—W)}-1/?. 

Assume, again for simplicity, that 7 is drawn from IIw. Then, for fixed 
u, the series on the right-hand side of (6.20) has zero mean, and equals 
O,(p'/?). In consequence, 


A(u) = Op(1) — pO/2)-" yu) + op (pO/2)-") . (6.21) 


Referring to the definition of hew at (6.4), we conclude from (6.21) that 
for fixed wu, 


hey < O,(1) — pO/2)-" y(u) + op (p9/-") . (6.22) 


If u is chosen so that 7(u) > 0 then, since 7 < T the subtracted term on 
the right-hand side of (6.22) diverges to —oo at a polynomial rate, and this 
behaviour is readily mistaken for detection of a value of Z that does not 
come from Iw. (There, the rate of divergence to zero can be particularly 
small; see section 6.3.3 and [{8].) This difficulty has arisen through the 
accumulation of errors in the distribution approximation. 


6.5. Numerical Properties in the Case of Classification 


We applied the higher-criticism classifier to simulated data. In each case, 
we generated ny = 10 vectors of dimension p = 10°, from Ilw = Ilx 
or Ily; and one observation Z from Ily. We generated the data such 
that, for 7 = 1,.. ae and j = 1,...,p; and with W denoting X or Y; 
Wi5 = (OW, -— 0%, Wy (52, + 52.5)/Nu + uw where, for k = 1 and 2, 


OM was the empirical mean and S2, ,, was the empirical variance of: (1) in 


the case of independence, Ny = 20 independent and identically distributed 
random variables, having the distribution function of a N(0,1) variable, a 
student Tio or a y2 random variable; and (2) in the case of dependence, 
Ny = 20 random variables of ie type VE us where, fori = 1,...,nw and 
§ = 1, 21.9%, Vi = = hy 6 Bh. Won, With 6 = 0.8 and eW , ~ N(0,(1+ 
6?)~+) denoting independent variables. 
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We set ux,; = 5(7 — 1)/(p — 1) and, in compliance with (C2), (6.10), 
(6.11), took wx = py for all but gq = (p'~%) randomly selected compo- 
nents, for which py, = j1x,j + /2w log p, where (-) denotes the integer-part 
function; and we considered different values of 3 € (5,1) and w € (0,4). 
Reflecting the results in sections 6.3.1 and 6.3.3, we estimated the unknown 
distribution function of the observed data as the standard normal distri- 
bution function. In all cases considered, we generated 500 samples in the 
manner described above, and we repeated the classification procedure 500 
times. Below we discuss the percentages of those samples which led to 
correct classification. 

Application of the method necessitated selection of the two parameters 
t and C defining Uw. In view of condition (A2), we reformulated Uw as 
Uw = |ti, t2], and we replaced choice of t and C by choice of t; and ta. If we 
have sufficient experience with the distributions of the data, t; and tz can be 
selected ‘theoretically’ to maximise the percentage of correct classifications. 

In the tables below we compare the results obtained using three meth- 
ods: the higher-criticism procedure for the optimal choice of (t1, tz), refer- 
ring to it as HC7; higher criticism without thresholding, i.e. for (t1,t2) = 
(—oo, 00), to which we refer as simply HC; and the thresholded nearest- 
neighbour method, NN (see e.g. [15] ),) i.e. the nearest-neighbour method 
applied to thresholded data W; [{W; > t}, where W denotes X, Y or Z 
and the threshold, t, is selected in a theoretically optimal way using the 
approach described above for choosing (1, t2). 

It is known ([15]) that, for normal variables, when the distribution of 
the observations is known, classification using HCr is possible if w and 
GB are above the boundary determined by (6.11), but classification using 
NN 7 is possible only above the more restricted boundary determined by 
w = 23-1. Below, we show that these results hold in our context too, where 
the distribution of the data is known only approximately (more precisely, 
estimated by the distribution of a standard normal variable). We shall con- 
sider values of (3, w) that lie above, on or below the boundary w = 26-1, 
including values which lie between this boundary and that for higher criti- 
cism. Tables 1 and 2 summarise results for the independent case (1), when 
the observations were averages of, respectively, Student Tj9 variables and 
x2 variables. In all cases, including those where classification was possible 
for both methods, we see that the thresholded higher-criticism method per- 
forms significantly better than the thresholded nearest-neighbour approach. 
The results also show very clearly the improvement obtainable using the 
thresholded version of higher criticism. 
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Table 1: Percentage of correct classifications if case (1) with Tio variables, 
using the optimal values of t, t; and tg. 


w = 0.2 w = 0.3 w= 0.4 w = 0.5 
FT| er HC er RE er WE er 
0.5 | 99.8 100 95.8 
0.6 | 77.4 86.4 83.0] 86.2 97.4 89.6]94.0 100 94.0 
0.65] 63.8 72.8 70.6|74.4 85.4 73.4|77.2 97.8 82.6 
0.7 63.2. 70.6 62.2|67.6 86.0 64.6]69.8 96.8 75.8 


0.75 59.0 72.8 58.2]}65.2 86.2 66.6 


Table 2: Percentage of correct classifications if case (1) with y? variables, 
using the optimal values of t, t; and tg. 


w = 0.2 w = 0.3 w= 0.4 w = 0.5 
[Ss Wr HE] Wor WE [RN Wor HE [RN HG 
99.8 100 97.6 
75.8 85.4 78.6]86.2 98.4 89.0 
0.65] 66.2 70.2 68.0/69.8 86.2 72.4 
64.4 74.8 64.2 : : 4 |75.0 97.0 69.8 
64.2 85.4 59.4 


Table 3: Percentage of correct classifications if case (1) with normal vari- 
ables (line 1), case (2) with L = 1 (line 2) or L = 8 (line 3), using the 
optimal values of t, t; and fo. 
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In Table 3 we compare the results of the independent case (1), where the 
data were Studentised means of independent N(0,1) variables and so had 
Student’s t distribution; and the dependent case (2), where the observations 
were Studentised means of correlated normal variables with either L = 1 
or L = 3. Here we see that as the strength of correlation increases, the 
nearest-neighbour method seems to deteriorate more rapidly than higher 
criticism, which, as indicated in section 6.3.2, remains relatively unaffected 
by lack of independence. 

If previous experience with data of the type being analysed is not suf- 
ficient to permit effective choice of threshold using that background, then 
a data-driven selection needs to be developed. We implemented a cross- 
validation procedure, described in Appendix A.1. 


6.6. Technical Arguments 
6.6.1. Proof of Theorem 6.1 


Since Yw(u) > C for each u € Up(C,t) then (Al) implies that 
vw(u)! 2s |; (w) — Vw,;(u)| = 0,(1) uniformly in u € Up (C,t), and 
hence that wy(u)/dw(u) = 1+ 0p(1), uniformly in u € Uw(C,t). Call 
this result Ry. That property and (A2) imply that with probability con- 
verging to 1 as p — oo, Uw(C3,t) C Uw (C2, t) C Up(Ci,t); call this 
result Ro. (Since ww(t) > C3 then t € Uw(C3,t), and so the latter set is 
nonempty.) Results Ri, R2 and (A1) together give (6.6). Property Rz and 
(6.6) imply (6.7). 


6.6.2. Proof of Theorem 6.2 


Let Ww be as in (B1). Since, in the case where Z is drawn from Ilw, Viw;, 
for 1 < j <p, have the joint distribution of Z; — W.,;, for 1 < 7 < p, then 
for Z from either IIx or Ily we may write Z; — W,; = Vw; +vwz;, where 
Vwzj = Uzj — bw ;. Substituting this representation for Z; — W. ; into the 
definition of hcw at (6.2), and defining Aw; = Vw; — Vw we see that 


Pp 
hey = ae ww (u) 1/2 > {l(|\Viv, + Aw; + vw zj\ < u) - Vw;(u)} . 
I= 
(6.23) 
Given D > 0 and v = 0 or +1, define 


Pp 
hewz(v) = inf pw(u)? S° {1(\Viy; + ewasl < w+ up~”) — Uw; (u)}, 
ucuUw f=] 
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hewyz = inf dw(u M2 {71 Vins + uw2j| <u) — P(\V;| <u}. 
j=l 


Let Ew denote the event that |Aw,;| < p-P for each 1 < j < p. In view 
of (B1)(a), 

for all C3>0, P(Ew)=1-O(p%). (6.24) 
Now, with probability 1, 


hewz(-1) < hewz(0) <hewz(1) and 
hewz(—-1) < hew < heyz(1) if Ew holds, (6.25) 


where we used (6.23) to obtain the second set of inequalities. Furthermore, 


< sup Vw LH (u —p P< |Win; tuwzj| <ut+p-”) 


ucuw 


< sup /obw OL Mea <|Vwi tyuwz;|<u+2p-”) (6.26) 


ucuyw 


where the first inequality holds with probability 1 and the second holds 
almost surely on Ew. 

Let 1 < 7; <p, and take Cy > 0. Using (B1) and (B2) it can be shown 
that the probability that there are no integers jg ~ 7, with 1 < jg < p and 


wa, tvzi.|— wis tvzjol| < Cap? , (6.27) 


is bounded below by 1— Cs p!~?™ uniformly in j;, where Cs > 0 and C, is 
the constant in (B2)(a). Adding over 1 < j; <p, and choosing D > 2C;’, 
we deduce that: 


The probability that there is no pair (j1, 72) of distinct indices such that 


—D 


Vij, + ¥az,| and |Vw;, + vaj,| are closer than C4 p~”, converges to zero 


as Pp — ov. (6.28) 


If, in the case Cy = 4, the inequality (6.27) fails for all distinct integer 
pairs (j1,j2) with 1 < j1,j2 < p, then the series on the far right-hand 
side of (6.26) can have no more than one nonzero term. That term, if it 
exists, must equal 1. In this case the far right-hand side of (6.26) cannot 


Higher Criticism 131 


exceed SUP, cry, wbw(u)-V/ ? which in turn is bounded above by a constant, 


Ce = C~'/?. Hence, (6.26) and (6.28) imply that 
P{0 < hewz(1) — hewz(-1) < Ce} 1. (6.29) 
Combining (6.24), (6.25) and (6.29) we deduce that 
P{|hew — heyz(0)| < Ce} > 1. (6.30) 
Observe too that, uniformly in u, 
|P(Vivg] <u) — Yw5(u)| < |Ywi (utp?) — Ywi(u—p-”)| + P(Ew) 
< Cr (4p-?) + P(Ew) = O(p-?™), 
where we have used (B2) and (6.24). Hence, 
cfr —elr2(0)1 $ sup vw(¥? S> [PUvirs| <u) — wal) 


j=l 


= O(p ?“). 


Combining this result and (6.30) we deduce that if Cg > 0 is chosen suffi- 
ciently large, 


P(|hew — heywz| < Cs) > 1. (6.31) 
Next we introduce further notation, defining Vywzj;(u) = P(|Vw; + 
Vw z3| < u), 
wy i (u) = P(\Vwsl Su), Wivz;(u) = P(Vw; + wail Su), 


Pp Pp 
dash __ dash dash 
Yvwz = Zs Uwz;1-Ywz), vwz= 5 Wirz; (1 — Wwy)) 
j=l 


Pp 
éwz = >— (Uw; - Ywz;), wwz = vw t+ owz, 


Pp 
WM?) N° {1(Vivj + vail < u) — VPP, (u)}], 


j=l 


he(?), = sup wwz(u) 
w 


Pp 
nel), = sup dw(u)/? S* {P(Vw5| < u) — PV; + vwz,l < w)} 
U — 


= sup vw(u 8 Do tte — wiPP,(u)} 
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The remainder of the proof develops approximations to he\? le and he. 
Using (B1)(a) and (B2)(a) it can be shown that, aioe in u, 


lowz _ ees | = O(p'-?%) = 0, (6.32) 


provided D > Cy'. Also, if D > C7’ then a similar argument can be used 
to show that, with hcyz defined as at (6.8), 


Ince, —hewz| 0. (6.33) 
By (B2)(b), 0 < Vwz; < Yw,; <1, from which it follows that 


Vw, (1 — Uws) + Uws — Uw; 
= Wwz; (1 — Uwz;) + (Ww; -— Ywz;) (2 — Uw 5 — Ywaz;) 
> Wwaz; (1 — Ywz;) 


for each j7. Adding over j we deduce that wwz > Wwz. Combining this 
result with (6.32), and noting that wwz(u) > vw(u) > C for ue Uw = 
Uw (C,t), we deduce that, for a constant Co > 0, 


for allucUw, wi?S(u) < Cowwz(u). (6.34) 


Write (-) for the integer-part function. Given € € (0,1), use (B1)(b) to 
break the sum inside the absolute value in the definition of he), taken 
over 1 <j <p, into (p‘) series, each consisting only of independent terms. 
Let Swzx(u), for 1 < k < (p*), denote the kth of these series. Now, 
E{Swzr(u)} = 0 and, for u € Up, 


var{ Sw zr (u )} < dash (a ) ae Co wywz(u), (6.35) 


where the variance is computed using the expression for Sw zp(u) as a sum 
of independent random variables, and the second inequality comes from 
(6.34). 

Employing (6.35), and noting again the independence property, stan- 
dard arguments can be used to show that for each choice of Cio, Cy, > 0, 


max Pf sup wwz(u)~*/? |Swzpr(u)| > pow} = Or): (6.36) 
1<k<(p*) ucuw 


In particular, using Rosenthal’s inequality, Markov’s inequality and the 
fact that wwz(u) > ww(u) > C for u € Uw, we may show that for all 
Bis Bo o 0, 


P{\s > ape =O 
a ac al 
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Therefore, if Vw = Vw(p) denotes any subset of Uw that contains only 
O(p®?*) elements, for some B3 > 0, then for all By, By > 0, 


max Pit maa wwz(u)/? |Swze(u)| > po} = Opes 23) = Oy **), 


1<k<(p*) 

(6.37) 
where By = Bo — B3. Since Bs and By, both can be taken arbitrarily 
large, then, using the monotonicity of the function g(u) = I(v < wu), and 
also properties (B1) and (B2), it can be seen that max,cy,, in (6.37) can 
be replaced by sup, cy, giving (6.36). In this context, condition (B2)(c) 
ensures that, with an error that is less than p~?°, for any given Bs > 0, 
the distribution Vy; can be truncated at a point p?*, for sufficiently large 
Bg; and, within the interval [0,p®*], the points in Vy can be chosen less 
than p~?7 apart, where B7 > 0 is arbitrarily large. 

Result (6.36) implies that 


a max sup w(t)” 1 ISwaelu)| > pb = O(—O), 


1<k<(p*) ucuw 


from which it follows that P(he?, > pet©) = O(p*-2"). Since €, Cio 
and Cj, are arbitrary positive numbers then we may replace € here by zero, 
obtaining: for each Cio, Ci, > 0, 


P(held), > p@) = O(p-%) . (6.38) 


It can be deduced directly from the definitions of hey z, he), and 
he\), that: 


lhe, theW,|<hc®, sup swe) =he®), sup ,/1+ owz(u) 


Combining this result with (6.31), (6.33) and (6.38); and noting that 


owz(u) 
hecwz = sup ———-, 
oe io a vw (u)!/? 


and, since wyw(u) > C for u € Uw, 


1/2 1/2 
sup {1+ Suet) | < (140-4) sup {1 owele) | 


ucluy Yw (u) ucuy Vw (u)1/2 
we deduce that for each € > 0, 
hew + hewz = Op{p* (1+ hewz)'"\. (6.39) 


Theorem 6.2 follows directly from (6.39). 


134 


A. Delaigle and P. Hall 


References 


[1] 


[2] 


[3] 


[12] 


[14] 


[15] 


[16] 


[17] 


Abramovich, F., Benjamini, Y., Donoho, D.L. and Johnstone, I.M. (2006). 
Adapting to unknown sparsity by controlling the false discovery rate. Ann. 
Statist. 34 584-653. 

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: 
a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. 
Ser. B. 57 289-300. 

Broberg, P. (2003). Statistical methods for ranking differentially expressed 
genes. Genome Biol. 4 R41 (electronic). 

Cai, T., Jin, J. and Low, M. (2007). Estimation and confidence sets for 
sparse normal mixtures. Ann. Statist., to appear. 

Cayon, L., Banday, A. J., Jaffe, T., Eriksen, H. K. K., Hansen, F. K., Gorski, 
K. M. and Jin, J. (2006). No higher criticism of the Bianchi corrected WMAP 
data. Mon. Not. Roy. Astron. Soc. 369 598-602. 

Cayon, L., Jin, J. and Treaster, A. (2005). Higher criticism statistic: De- 
tecting and identifying non-Gaussianity in the WMAP first year data. Mon. 
Not. Roy. Astron. Soc. 362 826-832. 

Cruz, M., Cayén, L., Martinez-Gonzalez, E., Vielva, P. and Jin, J. (2007). 
The non-Gaussian cold spot in the 3-year WMAP data. Astrophys. J. 655 
11-20. 

Donoho, D. L. and Jin, J. (2004). Higher criticism for detecting sparse het- 
erogeneous mixtures. Ann. Statist. 32 962-994. 

Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of 
a null hypothesis. J. Amer. Statist. Assoc. 99 96-104. 

Fan, J., Chen, Y., Chan, H. M., Tam, P. and Ren, Y. (2005). Removing 
intensity effects and identifying significant genes for Affymetrix arrays in 
MIF-suppressed neuroblastoma cells. Proc. Nat. Acad. Sci. USA 102 17751— 
17756. 

Fan, J. and Fan, Y. (2007). High dimensional classification using features 
annealed independence rules. Ann. Statist., to appear. 

Fan, J., Peng, H. and Huang, T. (2005). Semilinear high-dimensional model 
for normalization of microarray data: a theoretical analysis and partial con- 
sistency. (With discussion.) J. Amer. Statist. Assoc. 100 781-813. 

Fan, J., Tam, P., Vande Woude, G. and Ren, Y. (2004). Normalization and 
analysis of cDNA micro-arrays using within-array replications applied to 
neuroblastoma cell response to a cytokine. Proc. Nat. Acad. Sci. USA 101 
1135-1140. 

Hall, P. and Jin, J. (2006). Properties of higher criticism under long-range 
dependence. Ann. Statist., to appear. 

Hall, P., Pittelkow, Y. and Ghosh, M. (2008). On relative theoretical per- 
formance of classifiers suitable for high-dimensional data and small sample 
sizes. J. Roy. Statist. Soc. Ser. B, to appear. 

Huang, J., Wang, D. and Zhang, C. (2003). A two-way semi-linear model for 
normalization and significant analysis of cDNA microarray data. Manuscript. 
Ingster, Yu. I. (1999). Minimax detection of a signal for /”-balls. Math. 


18] 
[19] 


[20] 


[21] 


[22] 


[23 
24 


5) 


[26] 
27] 


[28] 


[29] 


[30] 
[31] 


[32] 


Higher Criticism 135 


Methods Statist. 7 401-428. 

Ingster, Yu. I. (2001). Adaptive detection of a signal of growing dimension. 
I. Meeting on Mathematical Statistics. Math. Methods Statist. 10 395-421. 
Ingster, Yu. I. (2002). Adaptive detection of a signal of growing dimension. 
II. Math. Methods Statist. 11 37-68. 

Ingster, Yu. I. and Suslina, I. A. (2003). Nonparametric Goodness-of-Fit 
Testing Under Gaussian Models. Springer, New York. 

Jin, J. (2006). Higher criticism statistic: Theory and applications in non- 
Gaussian detection. In Statistical Problems in Particle Physics, Astrophysics 
And Cosmology. (Eds. L. Lyons and M.K. Unel.) Imperial College Press, 
London. 

Jin, J., Starck, J.-L., Donoho, D. L., Aghanim, N. and Forni, O. (2004). 
Cosmological non-Gaussian signature detection: Comparing performance of 
different statistical tests. Eurasip J. Appl. Signal Processing 15 2470-2485. 
Lehmann, E. L., Romano, J. P. and Shaffer, J. P. (2005). On optimality of 
stepdown and stepup multiple test procedures. Ann. Statist. 33 1084-1108. 
Lonnstedt, I. and Speed, T. (2002). Replicated microarray data. Statist. 
Sinica 12 31-46. 

Meinshausen, M. and Rice, J. (2006). Estimating the proportion of false null 
hypotheses among a large number of independent tested hypotheses. Ann. 
Statist. 34 373-393. 

Scheffé, H. (1959). The Analysis of Variance. Wiley, New York. 

Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genome- 
wide experiments. Proc. Nat. Acad. Sci. USA 100 9440-9445. 

Storey, J. D., Taylor, J. E. and Siegmund, D. (2004). Strong control, conser- 
vative point estimation, and simultaneous conservative consistency of false 
discovery rates: A unified approach. J. Roy. Stat. Soc. Ser. B 66 187—205. 
Tseng, G. C., Oh, M. K., Rohlin, L., Liao, J. C. and Wong, W. H. (2001). 
Issues in cDNA microarray analysis: Quality filtering, channel normaliza- 
tion, models of variations and assessment of gene effects. Nucleic Acids Res. 
29, 2549-2557. 

Tukey, J.W. (1953). The problem of multiple comparisons. Manuscript. De- 
partment of Statistics, Princeton University. 

Wang, Q. (2005). Limit theorems for self-nomalized large deviation. Elec- 
tronic J. Probab. 38 1260-1285. 

Wang, Q. and Hall, P. (2007). Relative errors in central limit theorem for 
Student’s t statistic, with applications. Manuscript. 


Appendix 


A.1. Description of the Cross-Validation Procedure 


If previous experience with data of the type being analysed is not sufficient 
to permit effective choice of threshold using that background, then a data- 
driven selection needs to be developed. This, however, is a challenging task, 
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as the sample sizes are typically very small. As a first practical method, we 


implemented a cross-validation (CV) procedure where the basic idea was as 
follows. Create nx + ny cross-validation samples (Xcv,n, Vov,n, Zov.k) = 
(WC) ,T,W;), k=jtnxI(W =Y), j=1,...,nw, (W,T) = (X,Y) or 
(Y,X), where W‘-9) denotes the sample W with the jth observation W; 
left out; apply the classification procedure to each CV sample, and then 


choose (t;, tz) to give a large number of correct classifications, but not too 


large so as to avoid ‘overfitting’ the data. We experimented with different 


ways of avoiding the overfitting problem, and found that the following gave 


quite good results. 


(a) 


Here we describe how to choose the grid on which we search for (1, t2). 
One of the problems in our context is that p is so large that remov- 
ing one of the data values, as is done in cross-validation, has sub- 
stantial impact on the range of the observed data. Therefore, and 
since we expect tg to be related to the extreme observed values, it 
would not be appropriate to choose a grid for (t1,t2) and keep it fixed 
over each iteration of the algorithm. Instead, at each step k, where 
k =1,...,nx +ny, of the algorithm we define the grid in terms of 
a set of K € [2,2p — 1] order statistics Uj;,) < Ug.) < ... < Ua) 
of the vector U = (\Zov,k —_ XCV,k Zev. k — Yov,rl)- (To make no- 
tations less heavy, we omit the index k from U.) We keep fixed the 


vector I = (i1,...,i«K) of K indices. At each step we define our grid 
for (t1,t2) as Uz) x Uz), where Uz) denotes (U(;,),.-.,U(i,)). The 
indices 1 < iy < ig <... < ix < 2p are chosen such that the last, 
say, K — S order statistics Viis.,) < Viigya) < --- < Vax) of the vector 
V = (|Z — Xx|,|Z% — Ye) consist of the extreme values of V, and 
the first S order statistics Vii.) < Viz) < ... < Vag) are uniformly 
distributed over the interval [Vi1), Viis,,—1)]- 

For k = 1,...,nx + ny, apply the HC procedure to the kth cross- 
validation sample, for each (¢;,¢2) in the grid Uz) x U1). 

For each 1 < j,k < K, let Cj, denote the number of correct classifi- 
cations out of the nx + ny cross-validation trials at (b), obtained by 
taking (t1,t2) = (UG,), UG,)). Of course, since t; must be less than tg, 
we set C;, = 0 for all 7 > k. 

Taking V as in (a), construct the vector t3 of all values Vi;,) for 
which sup; Cj,~ = MW = sup; Cj,k — (nx + ny )/10. The factor 
(nx + ny)/10 was chosen heuristically and it is introduced to avoid 
overfitting the data. Take tz as the component of 3, say V(;,), for 
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which #{j s.t. Cj,¢ > M’} is the largest — in case of non uniqueness, 
take V(;,) as the largest such component. Then take ¢; as the average 
of all Vi;,)’s such that Cj,¢ > M’. 


In most cases this method gave good results, with performance lying 
approximately midway between that using the theoretically optimal (t,, t2) 
or no thresholding, i.e. (t1,t2) = (—0o, 00), respectively. 


A.2. Proof of Theorem 6.3 


For simplicity, we denote Nw; by N. Recall that yw, denotes the distri- 
bution of Vw, i.e. the distribution of Z; -W.; when Z is drawn from Hy. 
It can be proved from results of [31] that, under (C1), uniformly in values 
of u > 0 that satisfy u = o(N~'/6), and uniformly in W and in 1 < j <p, 


xw;(u) = O(u) + OLN”? |ul? {1 — O(u)}] , 


where ® denotes the standard normal distribution function. An analogous 
result for u < 0 also holds. Hence for u > 0 satisfying u = o(N~!/°), we 
have uniformly in W and in 1 <j <p, 


P(\Vw3| < u) = 28(u) -1+0[N—~1/? u3 {1 — B(u)}]. (6.40) 
Similarly it can be shown that, uniformly in j = ji,...,jg, the latter as 
n (C1), 


eee OR ea 
+O[N-1/? (utv)? {1—G(\u—v])}}. (6.41) 


Let awa(u) denote the series in the definition of hcwa, at (6.8). Com- 
bining (6.40) and (6.41) we deduce that, if Q 4 W, 


awa(u) = q{2 ®(u) a v) — &(u—v)} 
+O[N 7 vy? {1 — &(\u — v|)}, (6.42) 
bw (u) yates ®(u)} + OLN? pu? {1 — O(u)}] 
= {1+ 0(1)} 2p {2(u) — 1} {1 — ®(u)}, (6.43) 


uniformly in u € Uw(C,t). To obtain the second identity in (6.43) we used 
the properties t > B > 0 and N~1/? (log p)3/? — 0, from (C2) and (6.12) 
respectively. 
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Take u = /2vlogp where 0 < v = v(p) < 1, and recall that v = 
/2w logp, where w and s are as in (C2). It can be shown, borrowing 
ideas from [8], that 

20(u) — ®(u tv) —O(u—v) = gilp)p" VY", (6.44) 
{2O(u) — 1} {1—®(u)} ~ Cy (logp)"/? pp”, (6.45) 
where, here and below, g; denotes a function that is bounded above by C2 
and below by C3 (logp)~!/?, and C1, Cz, C3 denote positive constants. To 
derive (6.44), write 2 ®(u) — ®(u+v) — ®(u—v) as 
{1—@(u+v)}4+{1-— O®(u—v)}—2{1- ®(u)}, 
and use conventional approximations to 1 — ®(z), for moderate to large 


positive z, and, when u — v < 0, to ®(—z), for z in the same range. 
In view of (6.44) and (6.45), 


20(u) —O(u+yv) —O(uw—v 1/4 by 
Co RES TTCERTOONT = g2(p) (log p) a iar (6.46) 
where bj = $ (v — 1) — (/v — V/w)?. Similarly, 
N-V2 (u + v)? {1 — ®(|u —v|)} 
[p {1 — ®(u)}]/ 
Using (6.42), (6.43), (6.46) and (6.47) we deduce that, provided 
N~* (log p)* — 0, 


= O{N~'/? (logp)/4 p™\. (6.47) 


en ~ 4 92(p) (log p)'/* p* = g(p) (log p)'/4 p” , kaa) 


where by = $(u+1)—8-(Vu- Vw)”. 
Since s, in the definition of t = \/2s log p, satisfies 0 < s < min(4w, 1), 
we can take 


_ f4w if0<w<j 
1 — c(log p)~! log log p if;<w<l, 


where c > 4, and have 
= V/2v logp = min (2y, /2logp — 2cloglogp) € U(C,t). 
For this choice of v, bg = 27 where 7 > 0, and it follows from (6.48) that 


— awaQ(u) 
hewa 2 (uy? (u yas = Cap". 


Result (6.13) follows. 
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Motivated by the problems in genomics, astronomy and some other 
emerging fields, multiple hypothesis testing has come to the forefront 
of statistical research in the recent years. In the context of multiple 
testing, new error measures such as the false discovery rate (FDR) oc- 
cupy important roles comparable to the role of type I error in classical 
hypothesis testing. Assuming that a random mechanism decides the 
truth of a hypothesis, substantial gain in power is possible by estimating 
error measures from the data. Nonparametric Bayesian approaches are 
proven to be particularly suitable for estimation of error measure in mul- 
tiple testing situation. A Bayesian approach based on a nonparametric 
mixture model for p-values can utilize special features of the distribution 
of p-values that significantly improves the quality of estimation. In this 
paper we describe the nonparametric Bayesian modeling exercise of the 
distribution of the p-values. We begin with a brief review of Bayesian 
nonparametric concepts of Dirichlet process and Dirichlet mixtures and 
classical multiple hypothesis testing. We then review recently proposed 
nonparametric Bayesian methods for estimating errors based on a Dirich- 
let mixture of prior for the p-value density. When the test statistics are 
independent, a mixture of beta kernels can adequately model the p-value 
density, whereas in the dependent case one can consider a Dirichlet mix- 
ture of multivariate skew-normal kernel prior for probit transforms of 
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the p-values. We conclude the paper by illustrating the scope of these 
methods in some real-life applications. 


7.1. Bayesian Nonparametric Inference 


To make inference given an observed set of data, one needs to model how the 
data are generated. The limited knowledge about the mechanism often does 
not permit explicit description of the distribution given by a relatively few 
parameters. Instead, only very general assumptions leaving a large portion 
of the mechanism unspecified can be reasonably made. This nonparamet- 
ric approach thus avoids possible gross misspecification of the model, and 
understandably is becoming the preferred approach to inference, especially 
when many samples can be observed. Nonparametric models are actually 
not parameter free, but they contain infinite dimensional parameters, which 
can be best interpreted as functions. In common applications, the cumula- 
tive distribution function (c.d.f.), density function, nonparametric regres- 
sion function, spectral density of a time series, unknown link function in a 
generalized linear model, transition density of a Markov chain and so on can 
be the unknown function of interest. Classical approach to nonparametric 
inference has flourished throughout the last century. Estimation of c.d.f. 
is commonly done by the empirical c.d.f., which has attractive asymptotic 
properties. Estimation of density, regression function and similar objects in 
general needs smoothing through the use of a kernel or through a basis ex- 
pansion. Testing problems are generally approached through ranks, which 
typically form the maximal invariant class under the action of increasing 
transformations. 

Bayesian approach to inference offers a conceptually straightforward 
and operationally convenient method, since one needs only to compute 
the posterior distribution given the observations, on which the inference is 
based. In particular, standard errors and confidence sets are automatically 
obtained along with a point estimate. In addition, the Bayesian approach 
enjoys philosophical justification and often Bayesian estimation methods 
have attractive frequentist properties, especially in large samples. However, 
Bayesian approach to nonparametric inference is challenged by the issue 
of construction of prior distribution on function spaces. Philosophically, 
specifying a genuine prior distribution on an infinite dimensional space 
amounts to adding infinite amount of prior information about all fine details 
of the function of interest. This is somewhat contradictory to the motivation 
of nonparametric modeling where one likes to avoid specifying too much 
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about the unknown functions. This issue can be resolved by considering the 
so called “automatic” or “default” prior distributions, where some tractable 
automatic mechanism constructs most part of the prior by spreading the 
mass all over the parameter space, while only a handful of key parameters 
may be chosen subjectively. Together with additional conditions, large 
support of the prior helps the posterior distribution concentrate around the 
true value of the unknown function of interest. This property, known as 
posterior consistency, validates a Bayesian procedure from the frequentist 
view, in that it ensures that, with sufficiently large amount of data, the 
truth can be discovered accurately and the data eventually overrides any 
prior information. Therefore, a frequentist will be more likely to agree to 
the inference based on a default nonparametric prior. Lack of consistency 
is thus clearly undesirable since this means that the posterior distribution 
is not directed toward the truth. For a consistent posterior, the speed of 
convergence to the true value, called the rate of convergence, gives a more 
refined picture of the accuracy of a Bayesian procedure in estimating the 
unknown function of interest. 

For estimating an arbitrary probability measure (equivalently, a c.d.f.) 
on the real line, with independent and identically distributed (i.i.d.) obser- 
vations from it, Ferguson ({19]) introduced the idea of a Dirichlet process — 
a random probability distribution P such that for any finite measurable par- 
tition {Bi,..., B,} of R, the joint distribution of (P(B1),...,P(Bx,)) isa fi- 
nite dimensional Dirichlet distribution with parameters (a(B1),...,a(Br)), 
where a is a finite measure called the base measure of the Dirichlet pro- 
cess D,. Since clearly P(A) ~ Beta(a(A),a(A°)), we have E(P(A)) = 
a(A)/(a(A) + a(A®)) = G(A), where G(A) = a(A)/M, a probability mea- 
sure called the center measure and M = a(R), called the precision pa- 
rameter. This implies that if X|P ~ P and P ~ Dy, then marginally 
X ~G. Observe that var(P(A)) = G(A)G(A‘)/(M + 1), so that the prior 
is more tightly concentrated around its mean when M is larger. If P is 
given the measure Dy, we shall write P ~ DP(M,G). The following give 
the summary of the most important facts about the Dirichlet process: 


(i) If [ |~|dG < oo, then E(f ~dP) = f dG. 
(ii) If X1,...,X,|P “ Pand P ~ Da, then P|X1,...,Xn ~ Daten, 6x,- 
(ii) BGP |Xiws.9 Xa) = waG + 3757 Pn, a convex combination of the 
prior mean and the empirical distribution P,. 


(iv) Dirichlet sample paths are a.s. discrete distributions. 
(v) The topological support of Dy is {P* : supp(P*) C supp(G)}. 
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(vi) The marginal joint distribution of (X1,...,X») from P, where P ~ 
Da, can be described through the conditional laws 


6¢,, With probability —“—, j=1,...,k-i, 
at #d~ | oj, With probability 777, -7> J 


G, — with probability wes ; 


where k_; is the number of distinct observations in X;, 1 # i and 
o1,---,Pke_, are those distinct values with multiplicities n1,...,nx_,. 
Thus the number of distinct observations K, in Xj,...,Xn, is gen- 
erally much smaller than n with E(K,) = M)>Ui_,(M+i-1)71 ~ 
M log(n/M), introducing sparsity. 

(vii) Sethuraman’s ([51]) stick-breaking representation: P = S>°, Vido,, 
where 6; ~ G, Vi = (JJi211-Y)¥%;, Yi  Beta(1, M). This allows 
us to approximately generate a Dirichlet process and is indispensable in 
various complicated applications involving the Dirichlet process, where 
posterior quantities can be simulated approximately with the help of 


a truncation and Markov chain Monte-Carlo (MCMC) techniques. 


In view of (iii), clearly G should be elicited as the prior guess about P, 
while M should be regarded as the strength of this belief. Actual specifi- 
cation of these are quite difficult in practice, so we usually let G contain 
additional hyperparameters €, and some flat prior is put on €, leading to a 
mixture of Dirichlet process ((1]). 

A widely different scenario occurs when one mixes parametric families 
nonparametrically. Assume that given a latent variable 6;, the observa- 
tions X; follows a parametric density ~(-;6;), i = 1,...,n, respectively, 
and the random effects 6; “ P, P ~ Da ({20], [33]). In this case, the 
density of the observation can be written as fp(x) = f ~(a;0)dP(6). The 
induced prior distribution on fp through P ~ DP(M, G) is called a Dirich- 
let process mixture (DPM). Since fp(x) is a linear functional of P, the 
expressions of posterior mean and variance of the density fp(a) can be an- 
alytically expressed. However, these expressions contain enormously large 
number of terms. On the other hand, computable expressions can be ob- 
tained by MCMC methods by simulating the latent variables (0),...,0n) 
from their posterior distribution by a scheme very similar to (vi); see [18]. 
More precisely, given 0;, 7 A 7, only X; affects the posterior distribu- 
tion of 6;. The observation X; weighs the selection probability of an old 
0; by w(Xj;6;), and the fresh draw by M f[ o(X;;6)dG(0), and a fresh 
draw, whenever obtained, is taken from the “baseline posterior” defined by 
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dG';(0) « w(X;;@)dG(@). The procedure is known as the generalized Polya 
urn scheme. 

The kernel used in forming DPM can be chosen in different ways de- 
pending on the sample space under consideration. A location-scale kernel is 
appropriate for densities on the line with unrestricted shape. In Section 7.3, 
we shall use a special type of beta kernels for decreasing densities on the 
unit interval modeling the density of p-values in multiple hypothesis testing 
problem. 

To address the issue of consistency, let II be a prior on the densities and 
let fo stand for the true density. Then the posterior probability of a set B 
of densities given observations X1,...,X, can be expressed as 


nm Xs 
_ Ip Tim AeA dll(f) 
STs haley) 


When B is the complement of a neighborhood U of fo, consistency requires 


TFS Bl Xayw <a) (ri) 


showing that the expression above goes to 0 as n — oo as. [Py,]. This 
will be addressed by showing that the numerator in (7.1) converges to 
zero exponentially fast, while the denominator multiplied by e?” goes to 
infinity for all @ > 0. The latter happens if II(f : { folog(fo/f) < ©) > 
0 for all e > 0. The assertion about the numerator in (7.1) holds if a 
uniformly exponentially consistent test exists for testing the null hypothesis 
f = fo against the alternative f € U°. In particular, the condition holds 
automatically if U is a weak neighborhood, which is the only neighborhood 
we need to consider in our applications to multiple testing. 


7.2. Multiple Hypothesis Testing 


Multiple testing procedures are primarily concerned with controlling the 
number of incorrect significant results obtained while simultaneously test- 
ing a large number of hypothesis. In order to control such errors an ap- 
propriate error rate must be defined. Traditionally, the family-wise error 
rate (FWER) has been the error rate of choice until recently when the 
need was felt to define error rates that more accurately reflect the scientific 
goals of modern statistical applications in genomics, proteomics, functional 
magnetic resonance imaging (fMRI) and other biomedical problems. In 
order to define the FWER and other error rates we must first describe 
the different components of a typical multiple testing problem. Suppose 
Hyio0,---,; Hmo are m null hypotheses whose validity is being tested simul- 
taneously. Suppose mg of those hypotheses are true and after making 
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Table 7.1. Number of hypotheses accepted 
and rejected and their true status. 


Hypothesis | Accept Reject 


True U V 
False i 


Total Q R m 


decisions on each hypothesis, R of the m hypotheses are rejected. Also, 
denote the m ordered p-values obtained from testing the m hypotheses as 
X(1) < X(2) < +++ < X(m). Table 7.1 describes the components associated 
with this scenario. 

The FWER is defined as the probability of making at least one false 
discovery, ie. FWER = P(V > 1). The most common FWER controlling 
procedure is the Bonferroni procedure where each hypotheses is tested at 
level a/m to meet an overall error rate of a; see [35]. When m is large, this 
measure is very conservative and may not yield any “statistical discovery” , 
a term coined by [54] to describe a rejected hypothesis. Subsequently, 
several generalization of the Bonferroni procedure were suggested where 
the procedures depend on individual p-values, such as [52], [30], [31], [29] 
and [42]. In the context of global testing where one is interested in the 
significance of a set of hypotheses as a whole, [52] introduced a particular 
sequence of critical values, a; = ia/n, to compare with each p-value. More 
recently, researchers proposed generalization of the FWER (such as the 
k-FWER) that is more suitable for modern applications; see [32]. 

While the FWER gives a very conservative error rate, at the other 
extreme of the spectrum of error rates is the per comparison error rate 
(PCER) where significance of any hypothesis is decided without any regard 
to the significance of the rest of the hypothesis. This is equivalent to testing 
each hypothesis at a fixed level a and looking at the average error over the 
m tests conducted, ie. PCER = E(V/m). While the PCER is advocated 
by some ([53]) it is too liberal and may result in several false discoveries. A 
compromise was proposed by [7] where they described a sequential proce- 
dure to control the false discovery rate (FDR), defined as FDR = E(V/R). 
The ratio V/R is defined to be zero if there are no rejections. The FDR as 
an error rate has many desirable properties. First of all, as described in [7] 
and by many others, one can devise algorithms to control FDR in multiple 
testing situation under fairly general joint behavior of the test statistics 
for the hypotheses. Secondly, if all hypotheses are true, controlling FDR 
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is equivalent to controlling the FWER. In general, FDR falls between the 
other two error rates, the FWER and PCER (cf. [24]). 

The Benjamini-Hochberg (B-H) FDR control procedure is a sequential 
step-up procedure where the p-values (starting with the largest p-value) 
are sequentially compared with a sequence of critical values to find a criti- 
cal p-value such that all hypotheses with p-values smaller than the critical 
value are rejected. Suppose k = max{i : Xi) < ai} where a; = ia/m. 
The the B-H procedure rejects all hypotheses with p-values less than or 
equal to X (é)° If no such k exists, then none of the hypotheses is rejected. 
Even though the algorithm sequentially steps down through the sequence 
of p-values, it is called a step-up procedure because this is equivalent to 
stepping up with respect to the associated sequence of test statistics to 
find a minimal significant test value. The procedure is also called a lin- 
ear step-up procedure due to the linearity of the critical function a; with 
respect to 7. [9], [46], [59] among others have shown the FDR associated 
with this particular step-up procedure is exactly equal to moa/m in the 
case when the test statistics are independent and is less than moa/m if 
the test statistics have positive dependence: for every test function ¢, the 
conditional expectation E|¢(X1,..., Xm)|X;] is increasing with X; for each 
i. [46] has suggested an analogous step-down procedure where one fails 
to reject all hypotheses with p-values above a critical value a;, that is, if 
i= min{i : X(;) > a;}, none of the hypotheses associated with p-value X (i) 
and above is rejected. [46] used the same set of critical values a; = ia/m as 
in [7] which also controls the FDR at the desired level (see [47]). However, 
for the step-down procedure even in the independent case the actual FDR 
may be less than mpa/m. 

Since in the independent case the FDR of the linear step-up procedure 
is exactly equal to mpa/m, if the proportion of true null hypotheses, 7 = 
mo/m, is known then a can be adjusted to get FDR equal to any target 
level. Specifically, if a; = ia/(m7m) then the FDR of the linear step-up 
procedure is exactly equal to a in the independent case. Unfortunately, in 
any realistic situation mo is not known. Thus, in situations where 7 is not 
very close to one, FDR can be significantly smaller than the desired level, 
and the procedure may be very conservative with poor power properties. 

Another set of sequential FDR controlling procedures were introduced 
more recently, where 7 is adaptively estimated from the data and the critical 
values are modified as a; = ia/(m7). Heuristically, this procedure would 
yield an FDR close to taE(7#~'), and if 7 is an efficient estimator of 7 
then the FDR for the adaptive procedure will be close to the target level 
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a. However, merely plugging-in an estimator of 7 in the expression for a; 
may yield poor results due to the variability of the estimator of 7~!. [57] 
suggested using 7 = [m— R(A)+1]/|m(1—A)], where R(A) = >> LX; < A} 
is the number of p-values smaller than \ and 0 < A < 1 is a constant; here 
and below ll will stand for the indicator function. Similar estimators had 
been originally suggested by [56]. Then for any 4, choose the sequence of 
critical points as 


: ta(1 — A) 
a—nin Ae Roath 
The adaptive procedure generally yields tighter FDR control and hence can 
enhance the power properties of the procedures significantly ([8], [12]). Of 
course, the performance of the procedure will be a function of the choice of 
A. [58] suggested various procedures for choosing X. [11] suggested choosing 
A\ = a/(1+a) and they looked at the power properties of the adaptive pro- 
cedure. [50] investigated theoretical properties of these two stage procedures 
and [22] suggested analogous adaptive step-down procedures. 

The procedures described above for controlling FDR can be thought of 
as fixed-error rate approach where the individual hypotheses are tested at 
different significance level to maintain a constant overall error rate. [57, 58] 
introduced the fixed-rejection-region approach where a; = a for all i (i.e. 
the rejection region is fixed). The FDR given the rejection region is esti- 
mated from the data and then a is chosen to set the estimated FDR at 
a predetermined level. [57] also argued that since one becomes concerned 
about false discoveries only in the situation where there are some discover- 
ies, one should look at the expected proportion of false discoveries condi- 
tional on the fact that there has been some discoveries. Thus the positive 
false discovery rate (pFDR) is defined as pFDR = E(V/R|R > 0). [57] 
showed that if we assume a mixture model for the hypotheses, i.e., if we 
can assume that the true null hypothesis are arising as a Bernoulli sequence 
with probability 7, then the expression for pFDR reduces to 

TO 
pFDR(a) = F(a) 
where F’(-) is the marginal c.d.f. of the p-values. Although it cannot be 
controlled in the situation when there are no discoveries, given its simple 


(7.2) 


expression, pFDR is ideally suited for the estimation approach. Once an 
estimator for pFDR has been obtained, the error control procedure reduces 
to rejecting all p-values less than or equal to 7 where 


4 = max{y: pFDR(¥) < ah. (7.3) 
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Storey (cf. [57]) showed that the B-H linear step-up procedure can be viewed 
as Storey’s procedure where 7 is estimated by 1. Therefore, it is clear that 
using the procedure (7.3) will improve the power substantially unless 7 is 
actually very close to 1. 

Storey (cf. [58]) also showed that the pFDR can be given a Bayesian 
interpretation as the posterior probability of a null hypothesis being true 
given that it has been rejected. This interpretation connects the frequen- 
tist and the Bayesian paradigms in the multiple testing situation. Given 
that p-values are fundamental quantities that can be interpreted in both 
paradigms, this connection in the context of a procedure based on p-values 
is illuminating. Several multiple testing procedures have resulted by sub- 
stituting different estimators of pFDR in (7.3). Most of these procedures 
rely on the expression (7.2) and substitute the empirical c.d.f. for F(a) in 
the denominator. These procedures mainly differ in the way they estimate 
a. However, since 7a is less than or equal to F(a), there is always a risk of 
violating the inequality if one estimates F(a) and 7 independently. [60] sug- 
gested a nonparametric Bayesian approach that simultaneously estimates 
ma and F(a) within a mixture model framework that naturally constrain 
the estimators to maintain the relationship. This results in a more efficient 
estimator of pFDR. 

The case when the test statistics (equivalently, p-values) are dependent 
is of course of great practical interest. A procedure that controls the FDR 
under positive regression dependence was suggested in [9] where the B-H 
critical values are replaced by a; = mee The procedure is very con- 
servative because the critical values are significantly smaller than the B-H 
critical values. [50] suggested an alternative set of critical values and in- 
vestigated the performance under some special dependence structures. [21] 
and [17] suggested modeling the probit transform of the p-values as joint 
normal distribution to capture dependence among the p-values. A simi- 
lar procedure to model the joint behavior of the p-values was suggested 
by [44] who used a mixture of skew-normal densities to incorporate depen- 
dence among the p-values. This mixing distribution is then estimated using 
nonparametric Bayesian techniques described in Section 7.1. 

Other error measure such as the local FDR ({17]) were introduced to 
suit modern large dimensional datasets. While the FDR depends on the 
tail probability of the marginal p-value distribution, F(a), the local FDR 
depends on the marginal p-value density. Other forms of generalization 
can be found in ( [48], [49]) and the references therein. Almost all error 
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measures are functionals of the marginal p-value distribution, while few 
have been analyzed under the possibility of dependence among the p-values. 
A model based approach that estimates the components of the marginal 
distribution of the p-values has the advantage that once accurate estimates 
of the components of the marginal distribution are obtained, then it is 
possible to estimate several of these error measures and make a comparative 
study. Bayesian methodologies in multiple testing were discussed in [13], 
[60], [27] and [44]. [26] used a weighted p-value scheme that incorporates 
prior information about the hypothesis in the FDR controlling procedure. 
Empirical Bayes estimation of FDR was discussed in [15]. 

A particularly attractive feature of the Bayesian approach in the mul- 
tiple testing situation is its ability to attach a posterior probability to an 
individual null hypothesis being actually false. In particular, it is easy to 
predict the false discovery proportion (FDP), V/R. Let [;(a) = 1{X; < a} 
denote that the ith hypothesis is rejected at a threshold level a and let 
H;, be the indicator that the ith alternative hypothesis is true. The FDP 
process evaluated at a threshold a (cf. [25]) is defined by 


dias Fi()(1 — Hi) 
din Fi(@) + Tj - Le(@)) 


Assuming that (H;, [;(a)), i =1,...,m, are exchangeable, [44] showed that 


FDP(a) = 


FDR(a) = 7b(a)P(at least one rejection), where b(a) is the expected value 
of a function of the indicator functions. This implies that pFDR(a) = 
mb(a), which reduces to the old expression under independence. A similar 
expression was derived in [9] and also in [47]. In particular, [47] showed 
that the quantity b(a)/a is the expectation of a jackknife estimator of 
E/[(1 + R)—}). 

Thus the simple formula for pFDR as za/F (a) does not hold if the 
p-values are dependent, but the FDP with better conditional properties, 
seems to be more relevant to a Bayesian. Estimating the pFDR will gen- 
erally involve computing high dimensional integrals, and hence will be dif- 
ficult to obtain in reasonable time, but predicting the FDP is considerably 
simpler. Since the Bayesian methods are able to generate from the joint 
conditional distribution of (M,...,Hm) given data, we can predict the 
FDP by calculating its conditional expectation given data. 

The theoretical model for the null distribution of the p-values is 
U[0, 1]. The theoretical null model may not be appropriate for the ob- 
served p-values in many real-life applications due to composite null hy- 
pothesis, complicated test statistic or dependence among the datasets used 
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to test the multiple hypothesis. For a single hypothesis, the uniform null 
model may be approximately valid even for very complex hypothesis test- 
ing situations with composite null and complicated test statistics; see [4]. 
However, as argued by [16] and [6], if the multiple hypotheses tests are de- 
pendent then the mp null p-values collectively can behave very differently 
from a collection of independent uniform random variables. For example, 
the histogram of the probit transformed null p-values may be significantly 
skinnier than the standard normal, the theoretical null distribution of the 
probit p-values. [16] showed that a small difference between the theoretical 
null and an empirical null can have a significant impact on the conclusions 
of an error control procedure. Fortunately, large scale multiple testing sit- 
uations provide one with the opportunity to empirically estimate the null 
distribution using a mixture model framework. Thus, validity of the the- 
oretical null assumption can be tested from the data and if the observed 
values show significant departure from the assumed model, then the error 
control procedure may be built based on the empirical null distribution. 


7.3. Bayesian Mixture Models for p- Values 


As discussed in the previous section, p-values play an extremely impor- 
tant role in controlling the error in a multiple hypothesis testing problem. 
Therefore, it is a prudent strategy to base our Bayesian approach consid- 
ering p-values as fundamental objects rather than as a product of some 
classical testing procedure. Consider the estimation approach of Storey 
({57, 58]) discussed in the previous section. Here the false indicator H; 
of the ith null hypothesis, is assumed to arise through a random mech- 
anism, being distributed as independent Bernoulli variables with success 
probability 1— 7a. Under this scenario, even though the original problem of 
multiple testing belongs to the frequentist paradigm, the probabilities that 
one would like to estimate are naturally interpretable in a Bayesian frame- 
work. In particular, the pFDR function can be written in the form of a 
posterior probability. There are other advantages of the Bayesian approach 
too. Storey’s estimation method of 7 is based on the implicit assumption 
that the the density of p-values h under the alternative is concentrated near 
zero, and hence almost every p-value over the chosen threshold \ must arise 
from null hypotheses. Strictly speaking, this is incorrect because p-values 
bigger than A can occur under alternatives as well. This bias can be ad- 
dressed through elaborate modeling of the p-value density. Further, it is 
unnatural to assume that the value of the alternative distribution remains 
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fixed when the hypotheses themselves are appearing randomly. It is more 
natural to assume that, given that the alternative is true, the value of the 
parameter under study is chosen randomly according to some distribution. 
This additional level of hierarchy is easily absorbed in the mixture model 
for the density of p-values proposed below. 


7.3.1. Independent case: Beta mixture model for p-values 


In this subsection, we assume that the test statistics, and hence the p- 
values, arising from different hypotheses are independent. Then the p- 
values X1,..., Xm may be viewed as i.i.d. samples from the two component 
mixture model: f(x) = mg(x) + (1—7)h(x), where g stands for the density 
of p-values under the null hypothesis and h that under the alternative. 
The distribution of X; under the corresponding null hypothesis Ho; may 
be assumed to be uniformly distributed on [0,1], at least approximately. 
This happens under a number of scenarios: 


(i) the test statistic is a continuous random variable and the null hypoth- 
esis is simple; 

(ii) in situations like t-test or F-test, where the null hypothesis has been 

reduced to a simple one by considerations of similarity or invariance; 

(iii) if a conditional predictive p-value or a partial predictive p-value ({4], 
[41]) is used. 


Thus, unless explicitly stated, hereafter we assume that g is the uniform 
density. It is possible that this assumption fails to hold, which will be 
evident from the departure of the empirical null distribution from the the- 
oretical null. However, even when this assumption fails to hold, generally 
the actual g is stochastically larger than the uniform. Therefore it can be 
argued that the error control procedures that assume the uniform density 
remain valid in the conservative sense. Alternatively, this difference can be 
incorporated in the mixture model by allowing the components of the mix- 
ture distribution that are stochastically larger than the uniform distribution 
to constitute the actual null distribution. 

The density of p-values under alternatives is not only concentrated near 
zero, but usually has more features. In most multiple testing problems, 
individual tests are usually simple one-sided or two-sided z-test, ?-test, or 
more generally, tests for parameters in a monotone likelihood ratio (MLR) 
family. When the test is one-sided and the test statistic has the MLR prop- 
erty, it is easy to see that the density of p-values is decreasing (Proposition 1 
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of [27]). For two-sided alternatives, the null distribution of the test statistic 
is often symmetric, and in that case, a two-sided analog of the MLR prop- 
erty implies that the p-value density is decreasing (Proposition 2 of [27]). 
The p-value density for a one-sided hypothesis generally decays to zero as 
x tends to 1. For a two-sided hypothesis, the minimum value of the p-value 
density will be a (small) positive number. For instance, for the two-sided 
normal location model, the minimum value is e778?/ 2 where n is the sample 
size on which the test is based on. In either case, the p-value density looks 
like a reflected “J”, a shape exhibited by a beta density with parameters 
a<1andb> 1. In fact, if we are testing for the scale parameter of the 
exponential distribution, it is easy to see that the p-value density is exactly 
beta with a < 1 and b= 1. In general, several distributions on [0,1] can be 
well approximated by mixtures of beta distributions (see [14], [40]). Thus it 
is reasonable to approximate the p-value density under the alternative by an 
arbitrary mixture of beta densities with parameters a < 1 and b > 1, that 
is, h(x) = f be(z|a, b)dG(a, b), where be(x; a,b) = 27~!(1—2)’"!/B(a,b) is 
the beta density with parameters a and b, and B(a, b) = P(a)I'(b)/T (a+b) is 
the beta function. The mixing distribution can be regarded as a completely 
arbitrary distribution subject to the only restriction that G is concentrated 
in (0, 1) x [1, co). [60] took this approach and considered a Dirichlet process 
prior on the mixing distribution G. Note that, if the alternative values 
arise randomly from a population distribution and individual p-value den- 
sities conditional on the alternative are well approximated by mixtures of 
beta densities, then the beta mixture model continues to approximate the 
overall p-value density. Thus, the mixture model approach covers much 
wider models and has a distinct advantage over other methods proposed in 
the literature. The resulting posterior can be computed by an appropriate 
MCMC method, as described below. The resulting Bayesian estimator, be- 
cause of shrinkage properties, offers a reduction in the mean squared error 
and is generally more stable than its empirical counterpart considered by 
Storey ({57, 58]). [60] ran extensive simulation to demonstrate the advan- 
tages of the Bayesian estimator. 

The DPM model is equivalent to the following hierarchical model, where 
associated with each X; there is a latent variable 0; = (a;, b;), 
X;|0; ~ t+ (1—7)be(2;|9;), 91,.--,8miIG'& G and G~ DP(M,G»). 
The random measure G can be integrated out from the prior distribution 
to work with only finitely many latent variables 01,..., Am. 
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In application to beta mixtures, it is not possible to choose Gp to be 
conjugate with the beta likelihood. Therefore it is not possible to obtain 
closed-form expressions for the weights and the baseline posterior distribu- 
tion in the generalized Polya urn scheme for sampling from the posterior 
distribution of (0,,...,0,). To overcome this difficulty, the no-gaps algo- 
rithm ([34]) may be used, which can bypass the problems of evaluating the 
weights and sampling from the baseline posterior. For other alternative 
MCMC schemes, consult [36]. 

[60] gave detailed description of how the no-gaps algorithm can be im- 
plemented to generate samples from the posterior of (01,...,@m,7). Once 
MCMC sample values of (1,...,9m,7) are obtained, the posterior mean 
is approximately given by the mean of the sample z-values. Since the 
pFDR functional is not linear in (G,7), evaluation of the posterior mean 
of pFDR(a) requires generating posterior samples of the infinite dimen- 
sional parameter h using Sethuraman’s representation of G. This is not 
only cumbersome, but also requires truncating the infinite series to finitely 
many terms and controlling the error resulting from the truncation. We 
avoid this path by observing that, when m is large (which is typical in mul- 
tiple testing applications), the “posterior distribution” of G given 61,...,Am 
is essentially concentrated at the “posterior mean” of G given 41,...,4m, 
which is given by E(G|61,...,9m) = (M+m)~*MGo+(M+m)7" 71", 66,, 
where do(x) = I{@ < x} now stands for the c.d.f. of the distribution 
degenerate at 6. Thus the approximate posterior mean of pFDR(qa) can 
be obtained by the averaging the values of ma/[(M + m)~1MGo(a) + 
(M + m)~' 57", 60,(a)] realized in the MCMC samples. In the simula- 
tions of [60], it turned out that the sensitivity of the posterior to prior 
parameters is minimal. 

In spite of the success of the no gaps algorithm in computing the Bayes 
estimators of 7 and pFDR(a), the computing time is exorbitantly high in 
large scale applications. In many applications, real-time computing giving 
instantaneous results is essential. Newton’s algorithm ( [38], [39], [37]) is 
a computationally fast way of solving general deconvolution problems in 
mixture models, but it can also be used to compute density estimates. 

For a general kernel mixture, Newton’s algorithm may be described as 
follows: Assume that Y;,...,¥m “ h(y) = J k(y;@)v(0)dv(@), where the 
mixture density ~(@) with respect to the dominating measure (0) is to be 
estimated. Start with an initial estimate q%o(0), such as the prior mean, of 
(0). Fix weights 1 > w, > wo > +--+ Wm > 0 such as w; =i~+. Recursively 
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compute 


k(Yi; A) pi- (9 ) 
Wi() = (1 — wi)pi-1 (9) a CAT t)bi_a (t)dv(8)’ 
and declare Wm(0) as the final estimate 7)(9). The estimate is not a Bayes 
estimate (it depends on the ordering of the observations), but it closely 
mimics the Bayes estimate with respect to a DPM prior with kernel k(x; 6) 
and center measure with density Wo(0). If 0°, wi = 00 and SO>*, w? < 00 
then the mixing density is consistently estimated ([37], [28]). 

In the multiple testing context, v is the sum of point mass at 0 of 
size 1 and the Lebesgue measure on (0,1). Then 7 is identified as 7(0) 
and F(a) as ss +f, (0,a] w(0)d@. Then a reasonable estimate is obtained 


by W(0)a/[b(O)a + f, (0,a] «)(0)d6|. The computation is extremely fast and 
the sean of the estimator is often comparable to that of the Bayes 
estimator. 

Since 7 takes the most important role in the expression for the pFDR 
function, it is important to estimate 7 consistently. However, a conceptual 
problem arises because 7 is not uniquely identifiable from the mixture rep- 
resentation F(a) = mz +(1—7)H (ax), where H(-) is another c.d.f. on [0,1]. 
Note that the class of such distributions is weakly closed. The components 
am and H can be identified by imposing the additional condition that H can- 
not be represented as a mixture with another uniform component, which, 
for the case when H has a continuous density h, translates into h(1) = 0. 
Letting z(f) be the largest possible value of 7 in the representation, it 
follows that 7(F’) upper bounds the actual proportion of null hypothesis 
and hence the actual pFDR is bounded by pFDR(F;a@) := a(F)a/F(a). 
This serves the purpose from a conservative point of view. The functional 
m7(F’) and the pFDR are upper semicontinuous with respect to the weak 
topology in the sense that if F, >, F’, then limsup,,_,,, (Fn) < 7(F) 
and limsup,,_.., PFDR(F,;a) < pFDR(F; a). 

Full identifiability of the components 7 and H in the mixture represen- 
tation is possible under further restriction on F' if H(x) has a continuous 
density h with h(1) = 0 or the tail of H at 1 is bounded by C(1 — «)!*¢ 
for some Ce > 0. The second option is particularly attractive since it also 
yields continuity of the map taking F’ to 7 under the weak topology. Thus 
posterior consistency of estimating F’ under the weak topology in this case 
will imply consistency of estimating 7 and the pFDR function, uniformly 
on compact subsets of (0,1]. The class of distributions satisfying the lat- 
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ter condition will be called B and D will stand for the class of continuous 
decreasing densities on (0, 1]. 

Consider a prior II for H supported in BMD and independently a 
prior 4 for 7 with full support on [0,1]. Let the true value of 7 and h be 
respectively 7 and ho where 0 < m7 < 1 and Hp € BND. In order to 
show posterior consistency under the weak topology, we apply Schwartz’s 
result [55]. Clearly we need the true p-value density to be in the support 
of the beta mixture prior. A density h happens to be a pointwise mixture 
of be(a, b) with a< 1 and b>1 if H(e~”) or 1— H(1 —e7") is completely 
monotone, that is, has all derivatives which are negative for odd orders and 
positive for even orders. Since pointwise approximation is stronger than 
L,-approximation by Scheffe’s theorem, densities pointwise approximated 
by beta densities are in the L,-support of the prior in the sense that II(h : 
|h — holla < €) > O for all e > 0. Because both the true and the random 
mixture densities contain a uniform component, both densities are bounded 
below. Then a relatively simple analysis shows that the Kullback—Leibler 
divergence is essentially bounded by the L,-distance up to a logarithmic 
term, and hence fp = mo + (1 — 70)ho is in the Kullback—Leibler support 
of the prior on f = a+ (1 —7)h induced by II and yw. Thus by the 
consistency result discussed in Section 7.1 applies so that the posterior for 
F is consistent under the weak topology. Hence under the tail restriction 
on H described above, posterior consistency for 7 and pFDR follows. Even 
if the tail restriction does not hold, a one-sided form of consistency, which 
may be called “upper semi-consistency”, holds: For any « > 0, Pr(az < 
To + €|X1,..-,Xm) — 1 as. and that the posterior mean 7, satisfies 
lim SUP;n—+c0 im S To &.8. 

Unfortunately, the latter has limited significance since typically one 
would not like to underestimate the true 7 (and the pFDR) while overes- 
timation is less serious. When the beta mixture prior is used on h with the 
center measure of the Dirichlet process Go supported in (0,1) x (1+ €, 00) 
and ho is in the Ly-support of the Dirichlet mixture prior, then full poste- 
rior consistency for estimating 7 and pFDR holds. Since the Kullback— 
Leibler property is preserved under mixtures by Fubini’s theorem, the 
result continues to hold even if the precision parameter of the Dirichlet 
process is obtained from a prior and the center measure Go contains 
hyperparameters. 
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7.3.2. Dependent case: Skew-normal mixture model for 
probit p-values 


Due to the lack of a suitable multivariate model for the joint distribution 
of the p-values, most applications assume that the data associated with 
the family of tests are independent. However, empirical evidence obtained 
in many important applications such as f{MRI, proteomics (two-dimensional 
gel electrophoresis, mass-spectroscopy) and microarray analysis, shows that 
the data associated with the different tests for multiple hypotheses are more 
likely to be dependent. In an fMRI example, tests regarding the activa- 
tion of different voxels are spatially correlated. In diffusion tensor imaging 
problems, the diffusion directions are correlated and generate dependent 
observations over a spatial grid. Hence, a grid-by-grid comparison of such 
images across patient groups will generate several p-values that are highly 
dependent. 

The p-values, X;, take values in the unit interval on which it is hard to 
formulate a flexible multivariate model. It is advantageous to transform X; 
to a real-valued random variable Y;, through a strictly increasing smooth 
mapping W : [0,1] — R. A natural choice for W is the probit link func- 
tion, 6~!, the quantile function of the standard normal distribution. Let 
Y; = ®~'(X;) be referred to as the probit p-values. We shall build flexible 
nonparametric mixture models for the joint density of (Y1,...,¥m). 

The most obvious choice of a kernel is an m-variate normal density. 
Efron (cf. [17]) advocated in favor of this kernel. This can automatically 
include the null component, which is the standard normal density after 
the probit transformation of the uniform. However, the normal mixture 
has a shortcoming. As in the previous subsection, marginal density of a 
p-value is often decreasing. Thus the model on the probit p-values should 
conform to this restriction whenever it is desired so. The transformed 
version of a normal mixture is not decreasing for any choice of the mixing 
distribution unless all components have variance exactly equal to one. This 
prompts for a generalization of the normal kernel which still includes the 
standard normal as a special case but can reproduce the decreasing shape of 
the p-value density by choosing the mixing distribution appropriately. [44] 
suggested using the multivariate skew-normal kernel as a generalization of 
the normal kernel. The mixture of skew-normal distribution does provide 
decreasing p-value densities for a large subset of parameter configurations. 

To understand the point, it is useful to look at the unidimensional case. 
Let 


a(y; Hw, A) = 26(y; 1, w?)B(Aw™ *y) 
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denote the skew-normal density (cf. [2]) with location parameter ju, scale 
parameter w and shape parameter A, where ¢(y; 1, w”) denotes the N (1, w?) 
density and ®(-) denotes the standard normal c.d.f. The skew-normal family 
has got a lot of recent attention due to its ability to naturally generalize 
the normal family to incorporate skewness and form a much more flexible 
class. The skewness of the distribution is controlled by \ and when A = 0, it 
reduces to the normal distribution. If Y has density q(y; u,w, A), then [44] 
showed that the density of X = ®(Y) is decreasing in 0 < x < 1 if and only 
if 


w? >1, A> VW (w? — 1)/w? and p < AH*(B1(w?, d)), 
where ((w?,r) = (w? — 1)/(A2w?), 0 < A < 1, H*(fi) = inf[H (x) ~ 


(,x] and H(-) is the hazard function of the standard normal distribution. 
Now, since the class of decreasing densities forms a convex set, it follows 
that the decreasing nature of the density of the original p-value X will 
be preserved even when a mixture of skew-normal density q(y;,w, A) is 


considered, provided that the mixing measure K is supported on 
{(u,w, A): p< m(fi(w,r)), w21, A 2 V(w? — 1)/w?}. 


Location-shape mixtures of skew-normal family holding the scale fixed 
at w = 1 can be restricted to produce decreasing p-value densities if 
the location parameter is negative and shape parameter is positive. For 
scale-shape mixtures with the location parameter set to zero, the induced 
p-value densities are decreasing if the mixing measure has support on 
{(w,A): w > 1,A > V1—w-?}. Location-scale mixtures with the shape 
parameter set to zero is the same as location-scale mixtures of normal fam- 
ily. It is clear from the characterization that the normal density is unable 
to keep the shape restriction. This is the primary reason why we do not 
work with normal mixtures. 

By varying the location parameter jz and the scale parameter w in the 
mixture, we can generate all possible densities. The skew-normal kernel au- 
tomatically incorporates skewness even before taking mixtures, and hence 
it is expected to lead to a parsimonious mixture representation in presence 
of skewness, commonly found in the target density. Therefore we can treat 
the mixing measure K to be a distribution on 4 and w only and treat A 
as a hyperparameter. The nonparametric nature of K can be maintained 
by putting a prior with large weak support, such as the Dirichlet process. 
A recent result of [61] shows that nonparametric Bayesian density estima- 
tion based on a skew-normal kernel is consistent under the weak topology, 
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adding a strong justification for the use of this kernel. Interestingly, if the 
theoretical standard normal null distribution is way off from the empirical 
one, then one can incorporate this feature in the model by allowing K to 
assign weights to skew-normal components stochastically larger than the 
standard normal. 

In the multidimensional case, [44] suggested replacing the univariate 
skew-normal kernel by a multivariate analog. [3] introduced the multivariate 
skew-normal density 


SNin(y; o, 2, a) = 2bm(y; w, Q)O(aTA-*(y — p)), 


where @m is the m-variate normal density. Somewhat more flexibility in 
separating skewness and correlation is possible with the version of [45]. 
[44] considered a scale-shape mixture under restriction to illustrate the 
capability of the skew-normal mixture model. Most commonly arising pro- 
bit p-value densities can be well approximated by such mixtures. Analogous 
analysis is possible with mixtures of location, scale and shape. Consider 
an m xX m correlation matrix R with possibly a very sparse structure. Let 


O'S (Ciscoe) « @ = CiyecesQm) Gnd A = Qigseey des) e Let A; 
denote the indicators that the ith null hypothesis Hjo is false and let H = 
(H,,...,Hm)?. Then a multivariate mixture model for Y = (Yi,..., Ym)" 


is (Y|w,A,H, R) ~ SN,,(0;,a@) where 0 = A,,RA,,, A, = diag(w) is 
the diagonal matrix of scale parameters and a = R™'X is the vector of 
shape parameters. Let H; be i.i.d. Bernoulli(1 — 7), and independently 


dio, if H; =0, 


(wi,r)|H ~ 
me. # S21. 


The skew-mixture model is particularly suitable for Bayesian estimation. 
[44] described an algorithm for obtaining posterior samples. Using a result 
from [3], one can represent Y; = w;6;|U| + w;(1 — 6?)V;, where 6; = A4/(1 + 
A?), U is standard normal and V = (Vi,...,Vn)? is distributed as n- 
variate normal with zero mean and dispersion matrix R independently of 
U. This representation naturally lends itself to an iterative MCMC scheme. 
The posterior sample for the parameters in R can be used to validate the 
assumption of independence. Also, using the posterior samples it is possible 
to predict the FDP. 

It is not obvious how to formulate an analog of Newton’s esti- 
mate for dependent observations, but we outline the sketch of a strat- 
egy below. If the joint density under the model can be factorized as 
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Y,|0 ~ k1(¥1;98), Y2|(Y1, 4) a ko(Yo;¥1, 6), 8 By Ym|(X1,---) ¥m—1,9) ~ 
kim(Ym3 Y1,---;¥m-—1,9), then the most natural extension would be to use 


RAVE Viscan Ate ie ale) 
f Ril Yi; Yi, eae 5 Yeas t)w;_1(t)dv(t) ; 


Such factorizations are often available if the observations arise sequentially. 
On the other hand, if m is small and (Y;|Y;,7 4 7) are simple, we may 
use the kernel k;(y;|9, y;,j 4 7). More generally, if the observations can be 
associated with a decomposable graphical model, we can proceed by fixing 
a perfect order of cliques and then reducing to the above two special cases 
through the decomposition. 


Wi(9) = (1 — wi) Yi-1(8) + Wj (7.4) 


7.4. Areas of Application 


Multiple testing procedures have gained increasing popularity in statisti- 
cal research in view of their wide applicability in biomedical applications. 
Microarray experiments epitomize the applicability of multiple testing pro- 
cedures because in microarray we are faced with a severe multiplicity prob- 
lem where the error rapidly accumulates as one tests for significance over 
thousands of gene locations. We illustrate this point using a dataset ob- 
tained from the National Center for Biotechnology Information (NCBI) 
database. The data comes from an analysis of isografted kidneys from 
brain dead donors. Brain death in donors triggers inflammatory events in 
recipients after kidney transplantation. Inbred male Lewis rats were used 
in the experiment as both donors and recipients, with the experimental 
group receiving kidneys from brain dead donors and the control group re- 
ceiving kidneys from living donors. Gene expression profiles of isografts 
from brain dead donors and grafts from living donors were compared us- 
ing a high-density oligonucleotide microarray that contained approximately 
25,000 genes. [6] analyzed this dataset using a finite skew-mixture model 
where the mixing measure is supported on only a finite set of parameter 
values. Due to the high multiplicity of the experiment, even for a single 
step procedure with a very small a, the FDR can be quite large. [6] es- 
timated that the pFDR for testing for the difference between brain dead 
donors and living donors at each of the 25,000 gene locations at a fixed 
level a = 0.0075 is about 0.2. The mixture model framework also naturally 
provides estimates of effect size among the false null. While [6] looked at 
one sided t-test at each location to generate the p-values, they constructed 
the histogram of the 25,000 p-values generated from two-sided tests. The 
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left panel of Figure 7.1 gives a default MATLAB kernel-smoother estimate 
of the observed p-value histogram. The density shows a general decreas- 
ing shape except for local variation and at the edges. The spikes at the 
edges are artifacts of the smoothing mechanism. The bumpy nature of the 
smoothed histogram motivates a mixture approach to modeling. The his- 
togram in the probit scale is shown as the jagged line in the right panel in 
Figure 7.1. The smoothed curve is an estimate of the probit p-value density 
based on a skew-normal mixture model. Empirical investigation reveals the 
possibility of correlation among the gene locations. Thus, the multivariate 
skew-normal mixture would yield more realistic results by incorporating 
flexible dependence structure. 


Fig. 7.1. Density of p-values obtained from the ratdata: original scale (left) and probit 
scale (right). 


Another important application area of the FDR control procedure is 
fMRI. In {MRI data, one is interested in testing for brain activation in 
thousands of brain voxels simultaneously. In a typical experiment designed 
to determine the effect of covariate (say a drug or a disease status) on brain 
activation during a specific task (say eye movement), the available subjects 
will be divided into the treatment group (individual taking the drug or hav- 
ing a particular disease) and the control group (individuals taking a placebo 
or not having a disease) and their brain activation (blood oxygen level de- 
pendent signal) will be recorded at each voxel in a three dimensional grid in 
the brain. Then for each of the thousands of voxels, the responses for the in- 
dividuals in both groups are recorded and then two sample tests are carried 
out voxel-by-voxel to determine the voxels with significant signal difference 
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Fig. 7.2. fMRI slice activation image before and after FDR control. 


across groups. However due to severe multiplicity, too many voxels may be 
declared as significant discoveries. Many of these voxels can be adjudged 
unimportant based on physiological knowledge, but still many others may 
remain as potential discoveries. The left panel of Figure 7.2 shows the vox- 
els discovered as significant in a particular slice of the brain in a typical 
fMRI study (the details of the study are not given due to confidentiality 
issues, the figure is just used for illustration). The stand alone voxels with 
differential activation are potentially false discoveries where the contiguous 
clusters of voxels with significant activation pattern are potentially more 
meaningful findings. However, one needs to use statistical procedures to 
determine this as there will be tens of thousands of voxels and determining 
the validity of the findings manually is an infeasible task and a source of 
potential subjective bias. FDR control has been advocated by [23] to con- 
trol for false discoveries in {MRI experiments. An application of the B-H 
procedure removes most of the voxels as false discoveries while keeping only 
a few with strong signal difference among the two groups. Thus the B-H 
procedure for this application turns out to be very conservative, and con- 
flicts with scientific goal of finding anatomically rich activation patterns. 
An FDR control procedure that takes the dependence among voxels into 
account will be be more appropriate for this application. Work is underway 
to evaluate the merits of the dependent skew-mixture procedure in a typical 
fMRI dataset. 

[6] also gave an illustration of the pitfalls of constraining the p-value 
model to have a theoretical null component. In their example, the null com- 
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ponents were made up of two components, one which is slightly stochas- 
tically smaller than the theoretical null and the other which is slightly 
bigger. With a single theoretical null distribution fitted to the data, both 
components were poorly estimated while the unconstrained fit with no pre- 
specified theoretical null distribution gave an adequate approximation of 
both components. 

Of course the applicability of multiple testing procedures is not re- 
stricted to biomedical problems. While the biomedical problems have been 
the primary motivation for developing false discovery control procedures, 
FDR control procedures are equally important in other fields, such as as- 
tronomy, where one may be interested in testing significance of findings of 
several celestial bodies simultaneously. There are important applications 
in reliability, meteorology and other disciplines as well. 
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This survey covers state-of-the-art Bayesian techniques for the estima- 
tion of mixtures. It complements the earlier work [31] by studying new 
types of distributions, the multinomial, latent class and ¢ distributions. 
It also exhibits closed form solutions for Bayesian inference in some dis- 
crete setups. Lastly, it sheds a new light on the computation of Bayes 
factors via the approximation of [8]. 


8.1. Introduction 


Mixture models are fascinating objects in that, while based on elementary 
distributions, they offer a much wider range of modeling possibilities than 
their components. They also face both highly complex computational chal- 
lenges and delicate inferential derivations. Many statistical advances have 
stemmed from their study, the most spectacular example being the EM 
algorithm. In this short review, we choose to focus solely on the Bayesian 
approach to those models (cf. [42]). [20] provides a book-long and in-depth 
coverage of the Bayesian processing of mixtures, to which we refer the 
reader whose interest is woken by this short review, while [29] give a broader 
perspective. 


*Kate Lee is a PhD candidate at the Queensland University of Technology, Jean-Michel 
Marin is professor in Université Montpellier 2, Kerrie Mengersen is professor at the 
Queensland University of Technology, and Christian P. Robert is professor in Université 
Paris Dauphine and head of the Statistics Laboratory of CREST. 
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Without opening a new debate about the relevance of the Bayesian 
approach in general, we note that the Bayesian paradigm (cf. [41]) allows for 
probability statements to be made directly about the unknown parameters 
of a mixture model, and for prior or expert opinion to be included in the 
analysis. In addition, the latent structure that facilitates the description of 
a mixture model can be naturally aggregated with the unknown parameters 
(even though latent variables are not parameters) and a global posterior 
distribution can be used to draw inference about both aspects at once. 

This survey thus aims to introduce the reader to the construction, prior 
modelling, estimation and evaluation of mixture distributions within a 
Bayesian paradigm. Focus is on both Bayesian inference and computa- 
tional techniques, with light shed on the implementation of the most com- 
mon samplers. We also show that exact inference (with no Monte Carlo 
approximation) is achievable in some particular settings and this leads to 
an interesting benchmark for testing computational methods. 

In Section 8.2, we introduce mixture models, including the missing data 
structure that originally appeared as an essential component of a Bayesian 
analysis, along with the precise derivation of the exact posterior distribution 
in the case of a mixture of Multinomial distributions. Section 8.3 points out 
the fundamental difficulty in conducting Bayesian inference with such ob- 
jects, along with a discussion about prior modelling. Section 8.4 describes 
the appropriate MCMC algorithms that can be used for the approxima- 
tion to the posterior distribution on mixture parameters, followed by an 
extension of this analysis in Section 8.5 to the case in which the number of 
components is unknown and may be derived from approximations to Bayes 
factors, including the technique of [8] and the robustification of [2]. 


8.2. Finite Mixtures 


8.2.1. Definition 


A mixture of distributions is defined as a convex combination 
J J 
> pifj(2), pi, p; > 0, > 
j=l j=l 


of standard distributions f;. The p;’s are called weights and are most often 
unknown. In most cases, the interest is in having the f;’s parameterised, 
each with an unknown parameter 6;, leading to the generic parametric 
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mixture model 
J 
Yo psf (al6;). (8.1) 
j=l 


The dominating measure for (8.1) is arbitrary and therefore the nature 
of the mixture observations widely varies. For instance, if the dominating 
measure is the counting measure on the simplex of R™ 


one = {lorem Som + ' 


i=1 


the f;’s may be the product of ¢ independent Multinomial distributions, de- 
noted “Man (3 @j1; «++: 9jm) = @$.4Mim(13 G1, ++) Gjm)”, with m modalities, 
and the resulting mixture 


J 
foal 


is then a possible model for repeated observations taking place in Sm_¢. 
Practical occurrences of such models are repeated observations of contin- 
gency tables. In situations when contingency tables tend to vary more than 
expected, a mixture of Multinomial distributions should be more appropri- 
ate than a single Multinomial distribution and it may also contribute to 
separation of the observed tables in homogeneous classes. In the following, 


we note qj. = (gj1, ee :Qjm). 


Example 8.1. for J = 2, i = 4, py = fp. — 5, G. = (.2,.55.2,.1), 
qo. = (.3,.3,.1,.3) and € = 20, we simulate n = 50 independent realisations 
from model (8.2). That corresponds to simulating some 2 x 2 contingency 
tables whose total sum is equal to 20. Figure 8.1 gives the histograms for 
the four entries of the contingency tables. < 


Another case where mixtures of Multinomial distributions occur is the 
latent class model where d discrete variables are observed on each of n 
individuals ([30]). The observations (1 <i < n) are a; = (@j1,..., Xia), 
with x;, taking values within the m, modalities of the v-th variable. The 
distribution of a; is then 


Ji d 
jal 4 
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Fig. 8.1. For J = 2, pi = po = .5, qi. = (.2,.5,.2,.1), ga. = (.3, .3,.1,.3), € = 20 and 
n = 50 independent simulations: histograms of the m = 4 entries. 


so, strictly speaking, this is a mixture of products of Multinomials. The 
applications of this peculiar modelling are numerous: in medical studies, 
it can be used to associate several symptoms or pathologies; in genetics, it 
may indicate that the genes corresponding to the variables are not sufficient 
to explain the outcome under study and that an additional (unobserved) 
gene may be influential. Lastly, in marketing, variables may correspond to 
categories of products, modalities to brands, and components of the mixture 
to different consumer behaviours: identifying to which group a customer 
belongs may help in suggesting sales, as on Web-sale sites. 

Similarly, if the dominating measure is the counting measure on the set 
of integers N, the f;’s may be Poisson distributions P(A;) (A; > 0). We 
aim then to make inference about the parameters (p;, A;) from a sequence 
(x; )i=1,....n Of integers. 

The dominating measure may as well be the Lebesgue measure on R, in 
which case the f(x|@)’s may all be normal distributions or Student’s t dis- 
tributions (or even a mix of both), with 6 representing the unknown mean 
and variance, or the unknown mean and variance and degrees of freedom, 
respectively. Such a model is appropriate for datasets presenting multi- 
modal or asymmetric features, like the aerosol dataset from [38] presented 
below. 
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Example 8.2. The estimation of particle size distribution is important in 
understanding the aerosol dynamics that govern aerosol formation, which is 
of interest in environmental and health modelling. One of the most impor- 
tant physical properties of aerosol particles is their size; the concentration 
of aerosol particles in terms of their size is referred to as the particle size 
distribution. 

The data studied by [38] and represented in Figure 8.2 is from Hyytidld, 
a measurement station in Southern Finland. It corresponds to a full day of 
measurement, taken at ten minute intervals. < 


0 1 2 3 4 es 6 7 
Diameter (nm} 


Fig. 8.2. Histogram of the aerosol diameter dataset, along with a normal (red) and at 
(blue) modelling. 


While the definition (8.1) of a mixture model is elementary, its sim- 
plicity does not extend to the derivation of either the maximum likelihood 
estimator (when it exists) or of Bayes estimators. In fact, if we take n iid 
observations x = (%1,...,2n) from (8.1), with parameters 


P= Wi oss P7) and O= (Ojs 200503) 5 


the full computation of the posterior distribution and in particular the 
explicit representation of the corresponding posterior expectation involves 
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the expansion of the likelihood 


n J 
L(8,p|x) = |] > psf (#il;) (8.3) 


j=1 j=1 


into a sum of J” terms, with some exceptions (see, for example Section 
8.3). This is thus computationally too expensive to be used for more than 
a few observations. This fundamental computational difficulty in dealing 
with the models (8.1) explains why those models have often been at the 
forefront for applying new technologies (such as MCMC algorithms, see 
Section 8.4). 


8.2.2. Missing data 


Mixtures of distributions are typical examples of latent variable (or missing 
data) models in that a sample x1,..., 2» from (8.1) can be seen as a collec- 
tion of subsamples originating from each of the f (x;|6;)’s, when both the 
size and the origin of each subsample may be unknown. Thus, each of the 
x;’s in the sample is a priori distributed from any of the f;’s with probabili- 
ties p;. Depending on the setting, the inferential goal behind this modeling 
may be to reconstitute the original homogeneous subsamples, sometimes 
called clusters, or to provide estimates of the parameters of the different 
components, or even to estimate the number of components. 

The missing data representation of a mixture distribution can be ex- 
ploited as a technical device to facilitate (numerical) estimation. By a 
demarginalisation argument, it is always possible to associate to a random 
variable x; from a mixture (8.1) a second (finite) random variable z; such 
that 


vlz=z~ f(zl0z), Pla=3j)=pD;- (8.4) 


This auxiliary variable z; identifies to which component the observation x; 
belongs. Depending on the focus of inference, the z;’s may [or may not] 
be part of the quantities to be estimated. In any case, keeping in mind 
the availability of such variables helps into drawing inference about the 
“true” parameters. This is the technique behind the EM algorithm of [11] 
as well as the “data augmentation” algorithm of [51] that started MCMC 
algorithms. 
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8.2.3. The necessary but costly expansion of the likelihood 


As noted above, the likelihood function (8.3) involves J” terms when the 
nm inner sums are expanded, that is, when all the possible values of the 
missing variables z; are taken into account. While the likelihood at a given 
value (8, p) can be computed in O(nJ) operations, the computational diffi- 
culty in using the expanded version of (8.3) precludes analytic solutions via 
maximum likelihood or Bayesian inference. Considering n iid observations 
from model (8.1), if 7(@,p) denotes the prior distribution on (@,p), the 
posterior distribution is naturally given by 


Tt (0, p|x) x TS wei (2,105) | w{0,p)- 


t= 1 F=1 


It can therefore be computed in O(nJ) operations up to the normalising 
[marginal] constant, but, similar to the likelihood, it does not provide an 
intuitive distribution unless expanded. 

Relying on the auxiliary variables z = (z1,..., Zn) defined in (8.4), we 
take Z to be the set of all J” allocation vectors z. For a given vector 
(n1,...,77) of the simplex {n1 +...+nj7 =n}, we define a subset of Z, 


n nm 
2; = a) i lng = gy? » 


that consists of all allocations z with the given allocation sizes (n1,...,nJ), 
relabelled by 7 € N when using for instance the lexicographical ordering on 
(n1,...,m7). The number of nonnegative integer solutions to the decom- 
position of n into J parts such that nj +...+ nj =n is equal to (see [17]) 


eo) 
r= . 
n 


Thus, we have the partition Z = U;_,Z;. Although the total number of 
elements of Z is the typically unmanageable J”, the number of partition 
sets is much more manageable since it is of order n’~!/(J —1)!. It is thus 
possible to envisage an exhaustive exploration of the Z,;’s. ([6] did take 
advantage of this decomposition to propose a more efficient importance 
sampling approximation to the posterior distribution.) 

The posterior distribution can then be written as 


x (@,p|x) = ys > wt ™ (6, p|x, z) , (8.5) 


i=1 ZEZ; 
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where w (z) represents the posterior probability of the given allocation z. 
(See Section 8.2.4 for a derivation of w(z).) Note that with this represen- 
tation, a Bayes estimator of (@,p) can be written as 


> #@) Epix, . (8.6) 


This decomposition makes a lot of sense from an inferential point of view: 
the Bayes posterior distribution simply considers each possible allocation 
z of the dataset, allocates a posterior probability w(z) to this allocation, 
and then constructs a posterior distribution 7 (@, p|x, z) for the parameters 
conditional on this allocation. Unfortunately, the computational burden is 
of order O(J”). This is even more frustrating when considering that the 
overwhelming majority of the posterior probabilities w(z) will be close to 
zero for any sample. 


8.2.4. Exact posterior computation 


In a somewhat paradoxical twist, we now proceed to show that, in some 
very special cases, there exist exact derivations for the posterior distribu- 
tion! This surprising phenomenon only takes place for discrete distributions 
under a particular choice of the component densities f(x|0;). In essence, 
the f(x|@;)’s must belong to the natural exponential families, i.e. 


f(x|0:) = h(x) exp {8 - R(x) — V(A)} , 


to allow for sufficient statistics to be used. In this case, there exists a 
conjugate prior ([41]) associated with each 6 in f(z|@) as well as for the 
weights of the mixture. Let us consider the complete likelihood 


L*(8, p|x, z) = [le exp {02, - R(xi) — U(O.,)} 


t=1 
J 


~ [Tes exp ¢ 6; -S > R(ai) - 5 (0;) 


j=l 29 
J 
= [| 2)? exp {0 - 5; — nj, ¥(6,)} , 


where S; = )7,,-; R(xi). It is easily seen that we remain in an expo- 
nential family since there exist sufficient statistics with fixed dimension, 
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(n1,.--,M7,51,---,S7). Using a Dirichlet prior 


_ Pay ee a= axyaHl 
MW Pisesa> DT) = T'(az)---T'(az) Py Py 


on the vector of the weights (p;,...,p 7) defined on the simplex of RY and 
(independent) conjugate priors on the 6;’s, 


™(0;) x exp {Oj +7; — 6;U(9;)} , 


the posterior associated with the complete likelihood L°(6, p|x, z) is then 
of the same family as the prior: 


m(0, p|x, z) x (0, p) x L°(O, pla, z) 
x Te exp {9; -7; — 6;¥(0,;)} 
j=l 
x pj’ exp {0; - Sj — nj ¥(9;)} 
J 
= [Pf exp {85 (45 +85) — (6) +5) (0,)} 
j=l 


the parameters of the prior get transformed from a; to a; + nj, from 7; to 
7; + 5; and from 4; to 6; + ny. 

If we now consider the observed likelihood (instead of the complete 
likelihood), it is the sum of the complete likelihoods over all possible con- 
figurations of the partition space of allocations, that is, a sum over J” 
terms, 


S~ [] 2%" exp (6; S$; — nj ¥(,)} - 


The associated posterior is then, up to a constant, 


J 
do [ [0577 exp {5 - (75 +55) — (ng + 45) ¥6;)} 


=) (2) (9, ple, 2), 


where w(z) is the normalising constant that is missing in 


J 
[[ 0777" exp {6; « (rj + 55) — (ng + 6;)¥(8;)} - 
j=l 
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The weight w(z) is therefore 


id J 

.,l(n; +a; 
Lj=1 (nj + a5) <i  , 
Py a1{ny +O5}) ja 


if K(7,6) is the normalising constant of exp {6; -7 — dW(6;)}, ie. 


w(z) o 


Ko) = [exw {0; -7 — 6U(0;)} dd; . 


Unfortunately, except for very few cases, like Poisson and Multinomial 
mixtures, this sum does not simplify into a smaller number of terms because 
there exist no summary statistics. From a Bayesian point of view, the 
complexity of the model is therefore truly of magnitude O(J”). 

We process here the cases of both the Poisson and Multinomial mixtures, 
noting that the former case was previously exhibited by [16]. 


Example 8.3. Consider the case of a two component Poisson mixture, 
sea 
Li. 5 0g Bry “ pP(A1) + (1 — p) P(r2) ; 
with a uniform prior on p and exponential priors Exp(t1) and Exp(t2) on Ay 
and 2, respectively. For such a model, S; = >> 
constant is then equal to 


aaj Li and the normalising 
= 


Kio) = / exp {A;7 — d log(A;)} dA; 


a | Ay? exp{—dA;} dA; =0 "T(r). 
The corresponding posterior is (up to the overall normalisation of the 
weights) 
2 
[[ Pu + DFO + $))/G5 +2)" 


i? ee) 
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m(A, p|x, z) corresponds to a B(1+n;,1+n—n,) (Beta distribution) on 
p; and toaG(S;+1,7; +n;) (Gamma distribution) on r;, (j = 1,2). 

An important feature of this example is that the above sum does not in- 
volve all of the 2” terms, simply because the individual terms factorise in 
(n1, 2, 51,52) that act like local sufficient statistics. Since ng = n— ny, 
and Sy = >a; — S1, the posterior only requires as many distinct terms 
as there are distinct values of the pair (n,,S1) in the completed sam- 
ple. For instance, if the sample is (0,0,0,1,2,2,4), the distinct values 
of the pair (n1,51) are (0,0), (1,0), (1,1), (1, 2), (1, 4), (2,0), (2, 1), (2, 2), 
(2,3), (2, 4), (2, 5), (2,6), ..., (6, 5), (6, 7), (6,8), (7,9). Hence there are 41 
distinct terms in the posterior, rather than 2° = 256. < 


Let n = (nj,...,n7) and S = (S},..., 57). The problem of computing 
the number (or cardinal) j»(n,S) of terms in the sum with an identical 
statistic (n,S) has been tackled by [16], who proposes a recurrent formula 
to compute j,(n,S) in an efficient book-keeping technique, as expressed 
below for a k component mixture: 

Ife; denotes the vector of length J made of zeros everywhere except at 
component j where it is equal to one, if 


n= (Mijsaiyg)s ond m—ep= Nis...9% — 1Leis4Mz)s 


then 


Example 8.4. Once the ,(n,S)’s are all recursively computed, the pos- 
terior can be written as 


2 


1» ft (ME, S) II n;! Sil i; + ein m(0, plz, n, S) ’ 


(n,S) j=l 


up to a constant, and the sum only depends on the possible values of the 
“sufficient” statistic (n,$). 

This closed form expression allows for a straightforward representation 
of the marginals. For instance, up to a constant, the marginal in Aq is given 
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by 
2, 


S° I] rag! Sy!/(7q + 05)99 74 (ny +: 1)941N™ exp{—(ni + 1)Ar}/n1! 


zZ j=l 


2 
= SS pn(n,$) T] 5! 55!/ (ry +04) 
9=L 


(n,S) 
x (ny +7) AP exp{—(n1 + 71)A1}/ni!. 


The marginal in 2 is 


2 
a n(n, S) [] 23! 55'/G +n,)s4 
j=l 


(n,S) 
(nq + 72)52+1\52 exp{—(n2 + T2)A2}/ne! , 


again up to a constant. 

Another interesting outcome of this closed form representation is that 
marginal densities can also be computed in closed form. The marginal dis- 
tribution of x is directly related to the unnormalised weights in that 


“1 14! 551 / (1 +103) 594 


I] 
m(x) =) jw(2) = | n(n, §) (n+ 1)! 


% (n,S) 


up to the product of factorials 1/a1!---a,! (but this product is irrelevant in 
the computation of the Bayes factor). < 


Now, even with this considerable reduction in the complexity of the 
posterior distribution, the number of terms in the posterior still explodes 
fast both with n and with the number of components J, as shown through 
a few simulated examples in Table 8.1. The computational pressure also 
increases with the range of the data, that is, for a given value of (J,n), 
the number of values of the sufficient statistics is much larger when the 
observations are larger, as shown for instance in the first three rows of 
Table 8.1: a simulated Poisson P(A) sample of size 10 is mostly made of 
0’s when A = .1 but mostly takes different values when A = 10. The 
impact on the number of sufficient statistics can be easily assessed when 
J =A. (Note that the simulated dataset corresponding to (n, A) = (10, .1) 
in Table 8.1 happens to correspond to a simulated sample made only of 0’s, 
which explains the n + 1 = 11 values of the sufficient statistic (n1,.$1) = 
(n1,0) when J = 2.) 
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Table 8.1. Number of pairs (n,S) for simulated 
datasets from a Poisson P(A) and different numbers 
of components. (Missing terms are due to excessive 
computational or storage requirements.) 


(n, d) J=2 7=8 jaa 
(10, .1) ii 66 286 
(10, 1) 52 885 8160 
(10, 10) 166 7077 120,908 
(20, .1) 57 231 1771 
(20, 1) 260 20,607 566,512 
(20, 10) 565 100,713 — 
(30, .1) 87 4060 81,000 
(30, 1) 520 82,758 = 
(30, 10) 1413 637,020 — 
Example 8.5. If we have n observations n; = (ni1,...,Nim) from the 


Multinomial mixture 
ni ~ PMm (dis Gi1,---,Gm) + (1 — p)Mm (di; G21, - ++, Gam) 
where ni +-->+ Nim = d; and qi1 +--+ Gim = go1 +°--+d2m = 1, the 
conjugate priors on the qjy’s are Dirichlet distributions, (j = 1,2) 
(Gin a5 Ging.) DO, «gins 


and we use once again the uniform prior on p. (A default choice for the 
Qjy 8 18 Ajy = 1/2.) Note that the d;’s may differ from observation to 
observation, since they are irrelevant for the posterior distribution: given a 
partition z of the sample, the complete posterior is indeed 


2 2 = m™ 
peca—oy TE TT aft ---apar x TL TT aie”. 


J=1 uj j=1lv=1 


(where n; is the number of observations allocated to component j) up to a 
normalising constant that does not depend on z. < 


More generally, considering a Multinomial mixture with m components, 


m~ nM m(di; Gj1s++ ++ Gm) » 


the complete posterior is also directly available, as 


TL «TTD aan x TTD at” 


J=1 Hi=5 j=1lv=1 
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once more up to a normalising constant. 
Since the corresponding normalising constant of the Dirichlet distribu- 
tion is 
[hs P(ajv) 
D(aj1 + +++ + Qjm) 
the overall weight of a given partition z is 
TTyni PC + Siw) . Ton © (Gav + Soy) 
Doi +8? Ot +81.) Dea ++ +oam + Sx) 


where $j; is the sum of the n;;’s for the observations 7 allocated to compo- 


’ 


n1!ng! 


(8.7) 


nent j and 


Given that the posterior distribution only depends on those “sufficient” 
statistics S;; and n;, the same factorisation as in the Poisson case applies, 
namely we simply need to count the number of occurrences of a particular 
local sufficient statistic (n1,.$11,..., Sm) and then sum over all values of 
this sufficient statistic. The book-keeping algorithm of [16] applies. Note 
however that the number of different terms in the closed form expression is 
growing extremely fast with the number of observations, with the number 
of components and with the number k of modalities. 


Example 8.6. In the case of the latent class model, consider the simplest 
case of two variables with two modalities each, so observations are products 
of Bernoulli’s, 


x ~ pB(qii)B(q2) + (1 — p)B(q21)B(4q22) - 


We note that the corresponding statistical model is not identifiable beyond 
the usual label switching issue detailed in Section 8.8.1. Indeed, there 
are only two dichotomous variables, four possible realizations for the x’s, 
and five unknown parameters. We however take advantage of this artifi- 
cial model to highlight the implementation of the above exact algorithm, 
which can then easily uncover the unidentifiability features of the posterior 
distribution. 

The complete posterior distribution is the sum over all partitions of the 
terms 


2 2 
p™(1—p)™ T] I Ge - ae) «x Ta? 


j=l v=1 j=lv=1 
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where sj» = os =; Ziv, the sufficient statistic is thus (1, $11, $12, $21, $22), 

of order O(n°). Using the benchmark data of [50], made of 216 sample 
points involving four binary variables related with a sociological question- 
naire, we restricted ourselves to both first variables and 50 observations 
picked at random. A recursive algorithm that eliminated replicates gives the 
results that (a) there are 5,928 different values for the sufficient statistic 
and (b) the most common occurrence is the middle partition (26, 6, 11,5, 10), 
with 7.16 x 101% replicas (out of 1.12 x 10'° total partitions). The posterior 
weight of a given partition is 


T(ny +1) (n — n1 +1) 


ike spot ['(nj — 83» + 1/2) 
T'(n + 2) T(n 


2 
= s+) 


j=l v=1 


=T] TL Pin + 1/2905 = 550 41/2) /mat (n= myn FDL, 


multiplied by the number of occurrences. In this case, it is therefore possible 
to find exactly the most likely partitions, namely the one with ny = 11 and 
mo = 39, si,3 = 11, Sip = 8, Sa, = 0, S29 = 17, and the symmetric one, 
which both only occur once and which have a joint posterior probability 
of 0.018. It is also possible to eliminate all the partitions with very low 
probabilities in this example. < 


8.3. Mixture Inference 


Once again, the apparent simplicity of the mixture density should not be 
taken at face value for inferential purposes; since, for a sample of arbitrary 
size n from a mixture distribution (8.1), there always is a non-zero prob- 
ability (1 — p,;)” that the jth subsample is empty, the likelihood includes 
terms that do not bring any information about the parameters of the 7-th 
component. 


8.3.1. Nonidentifiability, hence label switching 


A mixture model (8.1) is senso stricto never identifiable since it is invariant 
under permutations of the indices of the components. Indeed, unless we 
introduce some restriction on the range of the 6;’s, we cannot distinguish 
component number 1 (i.e., 9) from component number 2 (i.e., #2) in the 
likelihood, because they are exchangeable. This apparently benign feature 
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has consequences on both Bayesian inference and computational implemen- 
tation. First, exchangeability implies that in a J component mixture, the 
number of modes is of order O(J!). The highly multimodal posterior sur- 
face is therefore difficult to explore via standard Markov chain Monte Carlo 
techniques. Second, if an exchangeable prior is used on 0 = (6),...,47), 
all the marginals of the 0;’s are identical. Other and more severe sources 
of unidentifiability could occur as in Example 8.6. 


Example 8.7. (Example 8.6 continued) If we continue our assess- 
ment of the latent class model, with two variables with two modalities each, 
based on subset of data extracted from [50], under a Beta, B(a,b), prior 
distribution on p the posterior distribution is the weighted sum of Beta 
B(n, +a,n—1n1 4+ b) distributions, with weights 


Lue (Sjy +1/2)0' (nj — sjy + 1/2) /'m (n —n1)! (n+ 1), 


where [n(n,s) denotes the number of occurrences of the sufficient statistic. 
Figure 8.8 provides the posterior distribution for a subsample of the dataset 
of [50] anda = b=1. Since p is not identifiable, the impact of the prior 
distribution is stronger than in an identifying setting: using a Beta B(a,b) 
prior on p thus produces a posterior [distribution] that reflects as much the 
influence of (a,b) as the information contained in the data. While a B(1, 1) 
prior, as in Figure 8.8, leads to a perfectly symmetric posterior with three 
modes, using an asymmetric prior with a < b strongly modifies the range 
of the posterior, as illustrated by Figure 8.4. < 


Identifiability problems resulting from the exchangeability issue are 
called “label switching” in that the output of a properly converging MCMC 
algorithm should produce no information about the component labels (a 
feature which, incidentally, provides a fast assessment of the performance 
of MCMC solutions, as proposed in [7]). A naive answer to the problem 
proposed in the early literature is to impose an identifiability constraint on 
the parameters, for instance by ordering the means (or the variances or the 
weights) in a normal mixture. From a Bayesian point of view, this amounts 
to truncating the original prior distribution, going from z (@, p) to 


(OD) Ue ,.ctag-s 


While this device may seem innocuous (because indeed the sampling 
distribution is the same with or without this constraint on the parameter 
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Fig. 8.3. Exact posterior distribution of p for a sample of 50 observations from the 
dataset of [50] anda =b=1. 


space), it is not without consequences on the resulting inference. This can 
be seen directly on the posterior surface: if the parameter space is reduced 
to its constrained part, there is no agreement between the above notation 
and the topology of this surface. Therefore, rather than selecting a single 
posterior mode and its neighbourhood, the constrained parameter space 
will most likely include parts of several modal regions. Thus, the resulting 
posterior mean may well end up in a very low probability region and be 
unrepresentative of the estimated distribution. 

Note that, once an MCMC sample has been simulated from an uncon- 
strained posterior distribution, any ordering constraint can be imposed on 
this sample, that is, after the simulations have been completed, for es- 
timation purposes as stressed by [48]. Therefore, the simulation (if not 
the estimation) hindrance created by the constraint can be completely by- 
passed. 

Once an MCMC sample has been simulated from an unconstrained pos- 
terior distribution, a natural solution is to identify one of the J! modal re- 
gions of the posterior distribution and to operate the relabelling in terms of 
proximity to this region, as in [31]. Similar approaches based on clustering 
algorithms for the parameter sample are proposed in [48] and [7], and they 
achieve some measure of success on the examples for which they have been 
tested. 
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Fig. 8.4. Exact posterior distributions of p for a sample of 50 observations from 
the dataset of [50] under Beta B(a,b) priors when a = .01,.05,.1,.05,1 and b = 
100, 50, 20, 10,5, 1. 


An alternative approach is to eliminate the label switching problem by 
removing the labels altogether. This is done for instance in [7] by defining 
a loss function for the pairwise allocation of observations to clusters. From 
another perspective, [33] propose to work directly on the clusters associated 
with a mixture by defining the problem as an exchangeable process on the 
clusters: all that matters is then how data points are grouped together and 
this is indeed label-free. 


8.3.2. Restrictions on priors 


From a Bayesian point of view, the fact that few or no observation in 
the sample is (may be) generated from a given component has a direct and 
important drawback: this prohibits the use of independent improper priors, 


J 
(0) =|] =(6;), 


since, if 


/ (0; )d0; = 00 


then for any sample size n and any sample x, 


| (6. pix)a6ap = 00: 
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The ban on using improper priors can be considered by some as being 
of little importance, since proper priors with large variances could be used 
instead. However, since mixtures are ill-posed problems, this difficulty with 
improper priors is more of an issue, given that the influence of a particular 
proper prior, no matter how large its variance, cannot be truly assessed. 

There exists, nonetheless, a possibility of using improper priors in this 
setting, as demonstrated for instance by [34], by adding some degree of de- 
pendence between the component parameters. In fact, a Bayesian perspec- 
tive makes it quite easy to argue against independence in mixture models, 
since the components are only properly defined in terms of one another. For 
the very reason that exchangeable priors lead to identical marginal poste- 
riors on all components, the relevant priors must contain some degree of 
information that components are different and those priors must be explicit 
about this difference. 

The proposal of [34], also described in [31], is to introduce first a common 
reference, namely a scale, location, or location-scale parameter (1,7), and 
then to define the original parameters in terms of departure from those 
references. Under some conditions on the reparameterisation, expressed 
in [43], this representation allows for the use of an improper prior on the 
reference parameter (j,7). See [35, 39, 53] for different approaches to the 
use of default or non-informative priors in the setting of mixtures. 


8.4. Inference for Mixtures with a Known Number of 
Components 


In this section, we describe different Monte Carlo algorithms that are cus- 
tomarily used for the approximation of posterior distributions in mixture 
settings when the number of components J is known. We start in Section 
8.4.1 with a proposed solution to the label-switching problem and then 
discuss in the following sections Gibbs sampling and Metropolis-Hastings 
algorithms, acknowledging that a diversity of other algorithms exist (tem- 
pering, population Monte Carlo...), see [42]. 


8.4.1. Reordering 


Section 8.3.1 discussed the drawbacks of imposing identifiability ordering 
constraints on the parameter space for estimation performances and there 
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are similar drawbacks on the computational side, since those constraints 
decrease the explorative abilities of a sampler and, in the most extreme 
cases, may even prevent the sampler from converging ([7]). We thus consider 
samplers that evolve in an unconstrained parameter space, with the specific 
feature that the posterior surface has a number of modes that is a multiple 
of J!. Assuming that this surface is properly visited by the sampler (and 
this is not a trivial assumption), the derivation of point estimates of the 
parameters of (8.1) follows from an ex-post reordering proposed by [31] 
which we describe below. 

Given a simulated sample of size M, a starting value for a point estimate 
is the naive approximation to the Maximum a Posteriori (MAP) estimator, 
that is the value in the sequence (0, p)“) that maximises the posterior, 

* l 

I" =arg max, 7((8,p)\ |x). 
Once an approximated MAP is computed, it is then possible to reorder all 
terms in the sequence (0, p) by selecting the reordering that is the closest 
to the approximate MAP estimator for a specific distance in the parameter 
space. This solution bypasses the identifiability problem without requiring 
a preliminary and most likely unnatural ordering with respect to one of 
the parameters (mean, weight, variance) of the model. Then, after the 
reordering step, an estimation of 6; is given by 


M 


> (6) /M. 


I=1 


8.4.2. Data augmentation and Gibbs sampling 
approximations 


The Gibbs sampler is the most commonly used approach in Bayesian mix- 
ture estimation ({12, 13, 15, 27, 52]) because it takes advantage of the 
missing data structure of the z;’s uncovered in Section 8.2.2. 

The Gibbs sampler for mixture models (8.1) (cf. [13]) is based on the 
successive simulation of z, p and @ conditional on one another and on the 
data, using the full conditional distributions derived from the conjugate 
structure of the complete model. (Note that p only depends on the missing 
data z.) 


Bayesian Inference on Mixtures 185 


Gibbs sampling for mixture models 


0. Initialization: choose p() and 9°) arbitrarily 
1. Step t. For f= 1,..; 


1.1 Generate zh) D1 nvey tO) MOM (9 SA, ns yeh) 
(a? =gel ey ai) ee FP (ele) 


1.2 Generate p from 1(p|z™ ) 
1.3 Generate 0 from x(6|z‘), x). 


As always with mixtures, the convergence of this MCMC algorithm is 
not as easy to assess as it seems at first sight. In fact, while the chain is 
uniformly geometrically ergodic from a theoretical point of view, the severe 
augmentation in the dimension of the chain brought by the completion 
stage may induce strong convergence problems. The very nature of Gibbs 
sampling may lead to “trapping states”, that is, concentrated local modes 
that require an enormous number of iterations to escape from. For example, 
components with a small number of allocated observations and very small 
variance become so tightly concentrated that there is very little probability 
of moving observations in or out of those components, as shown in [31]. As 
discussed in Section 8.2.3, [7] show that most MCMC samplers for mixtures, 
including the Gibbs sampler, fail to reproduce the permutation invariance of 
the posterior distribution, that is, that they do not visit the J! replications 
of a given mode. 


Example 8.8. Consider a mixture of normal distributions with common 
variance a? and unknown means and weights 


Saw (Mj,0 


This model is a particular case of model (8.1) and is not identifiable. Using 
conjugate exchangeable priors 


p~D(l,...,1), uj ~N(0,1007), 07? ~ Exp(1/2), 
it is straightforward to implement the above Gibbs sampler: 
e the weight vector p is simulated as the Dirichlet variable 


Dil+m,...,l+n3); 
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e the inverse variance as the Gamma variable 


0.1nj; x; 


Ge (n + 2)/2, (1/2) ol a +g) 


e and, conditionally on o, the means 4; are simulated as the Gaussian 
variable 


N (nj%5/ (nj + 0.1), 0?/(n; + 0.1)) : 


where nj = Do,,-3) 37 = Liz, ti and 83 = D0, _, (wi — By)? / nj. 
Note that this choice of ae ea ae allows for the block simulation of 
the means-variance group, rather than the more standard simulation of the 
means conditional on the variance and of the variance conditional on the 
means ([13]). | 
Consider the benchmark dataset of the galaxy radial speeds described 
for instance in [44]. The output of the Gibbs sampler is summarised on 
Figure 8.5 in the case of J = 3 components. As is obvious from the com- 
parison of the three first histograms (and of the three following ones), label 
switching does not occur with this sampler: the three components remain 
isolated during the simulation process. < 


Note that [22] (among others) dispute the relevance of asking for proper 
mixing over the k! modes, arguing that on the contrary the fact that the 
Gibbs sampler sticks to a single mode allows for an easier inference. We 
obviously disagree with this perspective: first, from an algorithmic point of 
view, given the unconstrained posterior distribution as the target, a sampler 
that fails to explore all modes clearly fails to converge. Second, the idea 
that being restricted to a single mode provides a proper representation 
of the posterior is naively based on an intuition derived from mixtures 
with few components. As the number of components increases, modes on 
the posterior surface get inextricably mixed and a standard MCMC chain 
cannot be garanteed to remain within a single modal region. Furthermore, 
it is impossible to check in practice whether or not this is the case. 

In his defence of “simple” MCMC strategies supplemented with post- 
processing steps, [22] states that 


Celeux et al.’s ({7]) argument is persuasive only to the extent that 
there are mixing problems beyond those arising from permutation in- 
variance of the posterior distribution. [7] does not make this argument, 
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Fig. 8.5. From the left to the right, histograms of the parameters (pi, p2, p3, 141, [2, 
3, 0) of anormal mixture with k = 3 components based on 104 iterations of the Gibbs 
sampler and the galaxy dataset, evolution of the o and of the log-likelihood. 


indeed stating “The main defect of the Gibbs sampler from our perspec- 
tive is the ultimate attraction of the local modes” (p. 959). That arti- 
cle produces no evidence of additional mixing problems in its examples, 
and we are not aware of such examples in the related literature. Indeed, 
the simplicity of the posterior distributions conditional on state assign- 
ments in most mixture models leads one to expect no irregularities of 
this kind. 


There are however clear irregularities in the convergence behaviour of Gibbs 
and Metropolis—Hastings algorithms as exhibited in [31] and [32] (Figure 
6.4) for an identifiable two-component normal mixture with both means 
unknown. In examples such as those, there exist secondary modes that 
may have much lower posterior values than the modes of interest but that 
are nonetheless too attractive for the Gibbs sampler to visit other modes. 
In such cases, the posterior inference derived from the MCMC output is 
plainly incoherent. (See also [24] for another illustration of a multimodal 
posterior distribution in an identifiable mixture setting.) 
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However, as shown by the example below, for identifiable mixture mod- 
els, there is no label switching to expect and the Gibbs sampler may work 
quite well. While there is no foolproof approach to check MCMC conver- 
gence ([42]), we recommend using the visited likelihoods to detect lack of 
mixing in the algorithms. This does not detect the label switching diffi- 
culties (but individual histograms do) but rather the possible trapping of 
a secondary mode or simply the slow exploration of the posterior surface. 
This is particularly helpful when implementing multiple runs in parallel. 


Example 8.9. (Example 8.2 continued) Consider the case of a mixture 
of Student’s t distributions with known and different numbers of degrees 
of freedom 


J 
> pity; (uy, o*) : 
j=l 


This mixture model is not a particular case of model (8.1) and is identifiable. 
Moreover, since the noncentral t distribution t,(u1,07) can be interpreted as 
a continuous mixture of normal distributions with a common mean and 
with variances distributed as scaled inverse y2 random variable, a Gibbs 
sampler can be easily implemented in this setting by taking advantage of 
the corresponding latent variables: x; ~ t,({4,02) is the marginal of 

2 

nMi~N (mt), Vi~ x2. 
V 
Once these latent variables are included in the simulation, the condi- 

tional posterior distributions of all parameters are available when using 
conjugate priors like 


Pre Dil, cxsh); [hj ~ N (U0, 202) , 0% ~ TG(az, Bo). 


The full conditionals for the Gibbs sampler are a Dirichlet D(1+ny1,...,1+ 
nz) distribution on the weight vector, inverse Gamma 


nN; (a; — pb 
TG} ag +2, fe + S> 
Gia +38 + 7 


ay 
, 


distribution on the variances 02;, normal 
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distribution on the means uj, and Gamma 


a 
2 z* yoo + (4 — p5)? 
distributions on the V;. 

In order to illustrate the performance of the algorithm, we simulated 
2,000 observations from the two-component t mixture with 1, = 0, be = 5, 
of = of = 1, 4% = 5, % = 11 and p, = 0.3. The output of the Gibbs 
sampler is summarized in Figure 8.6. The mixing behaviour of the Gibbs 
chains seems to be excellent, as they explore neighbourhoods of the true 
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Fig. 8.6. Histograms of the parameters, [1,01,P1,/@2,02, and evolution of the 
(observed) log-likelihood along 30,000 iterations of the Gibbs sampler and a sample 
of 2,000 observations. 


The example below shows that, for specific models and a small number 
of components, the Gibbs sampler may recover the symmetry of the target 
distribution. 


Example 8.10. (Example 8.6 continued) For the latent class model, 
if we use all four variables with two modalities each in [50], the Gibbs 
sampler involves two steps: the completion of the data with the component 
labels, and the simulation of the probabilities p and q; from Beta B(st; + 
5,n; — 8tj + .5) conditional distributions. For the 216 observations, the 
Gibbs sampler seems to converge satisfactorily since the output in Figure 
8.7 exhibits the perfect symmetry predicted by the theory. We can note that, 
in this special case, the modes are well separated, and hence values can be 
crudely estimated for q,; by a simple graphical identification of the modes. 
< 
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Fig. 8.7. Latent class model: histograms of p and of the q:;’s for 104 iterations of the 
Gibbs sampler and the four variables of [50]. The first histogram corresponds to p, the 
next on the right to qi1, followed by q21 (identical), then q21, q22, and so on. 


8.4.3. Metropolis—Hastings approximations 


The Gibbs sampler may fail to escape the attraction of a local mode, even 
in a well-behaved case as in Example 1 where the likelihood and the pos- 
terior distributions are bounded and where the parameters are identifiable. 
Part of the difficulty is due to the completion scheme that increases the 
dimension of the simulation space and that reduces considerably the mo- 
bility of the parameter chain. A standard alternative that does not require 
completion and an increase in the dimension is the Metropolis—Hastings 
algorithm. In fact, the likelihood of mixture models is available in closed 
form, being computable in O(Jn) time, and the posterior distribution is 
thus available up to a multiplicative constant. 


General Metropolis—Hastings algorithm for mixture models 


0. Initialization. Choose p and a) 
1, Step t. For ¢=1,««4 


1.1 Generate (0, p) from q (0, pia"), pe), 
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1.2 Compute 


_ f (x0, B)r(0, Bq, p¢- |, p) 
f(x|aO—), pO) (a), pt-D)q(0, pla’, pt)’ 


1.3 Generate u ~ %o,1] 
If r > u then (0p) = (0, P) 
ee (a, p®) = (9°) pt-d), 


The major difference with the Gibbs sampler is that we need to choose 
the proposal distribution g, which can be a priori anything, and this is a 
mixed blessing! The most generic proposal is the random walk Metropolis— 
Hastings algorithm where each unconstrained parameter is the mean of the 
proposal distribution for the new value, that is, 


0; = ad + U; 


where u; ~ N(0,¢ 2). However, for constrained parameters like the weights 
and the variances in a normal mixture model, this proposal is not efficient. 

This is indeed the case for the parameter p, due to the constraint that 
ar p; =1. To solve this difficulty, [5] propose to overparameterise the 
model (8.1) as 


J 
p= ay /Somr, i; >0, 
i= 


thus removing the simulation constraint on the p;’s. Obviously, the w,’s 
are not identifiable, but this is not a difficulty from a simulation point 
of view and the p,;’s remain identifiable (up to a permutation of indices). 
Perhaps paradoxically, using overparameterised representations often helps 
with the mixing of the corresponding MCMC algorithms since they are less 
constrained by the dataset or the likelihood. The proposed move on the 


w;’s is log(w;) = log(w{) +u,; where u; ~ N(0,¢?). 


Example 8.11. (Example 8.2 continued) We now consider the more 
realistic case when the degrees of freedom of the t distributions are unknown. 
The Gibbs sampler cannot be implemented as such given that the distribu- 
tion of the v;’s is far from standard. A common alternative ([42]) is to 
introduce a Metropolis step within the Gibbs sampler to overcome this diffi- 
culty. If we use the same Gamma prior distribution with hyperparameters 
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(av, Bv) for all the v;s, the density of the full conditional distribution of v; 
is proportional to 


1/2 v;/2 _ Le] . ie = gO 
(Seay V; ia) yy * exp(—S,v;) I V7" exp (ae ae : } . 
Therefore, we resort to a random walk proposal on the log(v;)’s with scale 
¢ =5. (The hyperparameters are a, = 5 and B, = 2.) 

In order to illustrate the performances of the algorithm, two cases are 
considered: (i) all parameters except variances (0% = 03 = 1) are unknown 
and (ii) all parameters are unknown. For a simulated dataset, the results 
are given on Figure 8.8 and Figure 8.9, respectively. In both cases, the 
posterior distributions of the v;’s exhibit very large variances, which indi- 
cates that the data is very weakly informative about the degrees of freedom. 
The Gibbs sampler does not mix well-enough to recover the symmetry in 
the marginal approximations. The comparison between the estimated den- 
sities for both cases with the setting is given in Figure 8.10. The estimated 
mixture densities are indistinguishable and the fit to the simulated dataset 
is quite adequate. Clearly, the corresponding Gibbs samplers have recovered 
correctly one and only one of the 2 symmetric modes. 
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Fig. 8.8. Histograms of the parameters (41,11, p1,42,V2 when the variance parameters 
are known, and evolution of the log-likelihood for a simulated t mixture with 2,000 
points, based on 3 x 104 MCMC iterations. 


We now consider the aerosol particle dataset described in Example 8.2. 
We use the same prior distributions on the v;’s as before, that is G(5, 2). 
Figure 8.11 summarises the output of the MCMC algorithm. Since there 
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Fig. 8.9. Histograms of the parameters 1,01,11,P1, 142,02,V2, and evolution of the 
log-likelihood for a simulated t mixture with 2,000 points, based on 3 x 104 MCMC 


iterations. 


Fig. 8.10. Histogram of the simulated dataset, compared with estimated t mixtures 
with known o? (red), known v (green), and when all parameters are unknown (blue). 


is no label switching and only two components, we choose to estimate the 
parameters by the empirical averages, as illustrated in Table 8.2. As shown 


by Figure 8.2, both t mixtures and normal mixtures fit the aerosol data 


reasonably well. < 
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Fig. 8.11. Histograms of parameters (f41,01,V1,P1,@1,02,V2) and log-likelihood of a 
mixture of t distributions based on 30,000 iterations and the aerosol data. 


Table 8.2. Estimates of the parameters for the aerosol dataset compared for t 
and normal mixtures. 


M1 2 O1 02 V1 v2 Pi 
Student 2.5624 3.9918 0.5795 0.3595 18.5736 19.3001 0.3336 
Normal 2.5729 3.9680 0.6004 0.3704 7 = 0.3391 


8.5. Inference for Mixture Models with an Unknown 
Number of Components 


Estimation of J, the number of components in (8.1), is a special type of 
model choice problem, for which there are a number of possible solutions: 


(i) direct computation of the Bayes factors ([8, 25]); 
(ii) evaluation of an entropy distance ([34, 45]); 
(iii) generation from a joint distribution across models via reversible jump 
MCMC (([40]) or via birth-and-death processes ([49]); 
(iv) indirect inference though Dirichlet process mixture models. 


We refer to [31] for a short description of the reversible jump MCMC so- 
lution, a longer survey being available in [42] and a specific description 
for mixtures—including an R package—being provided in [32]. The alterna- 
tive birth-and-death processes proposed in [49] has not generated as much 
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follow-up, except for [5] who showed that the essential mechanism in this 
approach was the same as with reversible jump MCMC algorithms. 

Dirichlet processes ([1, 18]) are often advanced as alternative to the 
estimation of the number of components for mixtures because they natu- 
rally embed a clustering mechanism. A Dirichlet process is a nonparametric 
object that formally involves a countably infinite number of components. 
Nonetheless, inference on Dirichlet processes for a finite sample size pro- 
duces a random number of clusters. This can be used as an estimate of the 
number of components. From a computational point of view, see [37] for 
a MCMC solution and [3] for a variational alternative. We note that the 
proposal of [33] mentioned earlier also involves a Dirichlet cluster process 
modeling that leads to a posterior on the number of components. 

We focus here on the first two approaches, because, first, the descrip- 
tion of reversible jump MCMC algorithms require much care and therefore 
more space than we can allow to this paper and, second, this description 
exemplifies recent advances in the derivation of Bayes factors. These solu- 
tions pertain more strongly to the testing perspective, the entropy distance 
approach being based on the Kullback—Leibler divergence between a J com- 
ponent mixture and its projection on the set of J — 1 mixtures, in the same 
spirit as in [14]. Given that the calibration of the Kullback divergence is 
open to various interpretations ([14, 23, 34]), we will only cover here some 
proposals regarding approximations of the Bayes factor oriented towards 
the direct exploitation of outputs from single model MCMC runs. 

In fact, the major difference between approximations of Bayes factors 
based on those outputs and approximations based on the output from the 
reversible jump chains is that the latter requires a sufficiently efficient choice 
of proposals to move around models, which can be difficult despite signif- 
icant recent advances ([4]). If we can instead concentrate the simulation 
effort on single models, the complexity of the algorithm decreases (a lot) and 
there exist ways to evaluate the performance of the corresponding MCMC 
samples. In addition, it is often the case that few models are in competition 
when estimating J and it is therefore possible to visit the whole range of 
potential models in an exhaustive manner. 

We have 


m 
fr(x|Xz) = [Td F eles) 


where A; = (0,p) = a Most solutions [20] revolve 
around an importance sampling approximation to the marginal likelihood 
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integral 
my(x) = f fol@las) AN ads 


where J denotes the model index (that is the number of components in 
the present case). For instance, [28] use bridge sampling with simulated 
annealing scenarios to overcome the label switching problem. [47] rely on 
defensive sampling and the use of conjugate priors to reduce the integration 
to the space of latent variables (as in [6]) with an iterative construction of 
the importance function. [19] also centers her approximation of the marginal 
likelihood on a bridge sampling strategy, with particular attention paid to 
identifiability constraints. A different possibility is to use the representation 
in [21]: representation: starting from an arbitrary density g7, the equality 


Leen Tabane 


—ga(Ar) 
fa fa(alAz) 13 (Az) meg) 


implies that a potential estimate of m7(x) is 


9=1/5¥ gi (A) 
= fr(e]A®) (A) 


when the NM s are produced by a Monte Carlo or an MCMC sampler 
targeted at 77(Az|a). While this solution can be easily implemented in low 


= 1s (2 wy(Az|x) dAZ 


dimensional settings ({10]), calibrating the auxiliary density g, is always an 
issue. The auxiliary density could be selected as a non-parametric estimate 
of 7,(AJz|z) based on the sample itself but this is very costly. Another 
difficulty is that the estimate may have an infinite variance and thus be too 
variable to be trustworthy, as experimented by [19]. 

Yet another approximation to the integral m;(ax) is to consider it as the 
expectation of f7(a|A7), when A, is distributed from the prior. While a 
brute force approach simulating A, from the prior distribution is requiring 
a huge number of simulations ((36]), a Riemann based alternative is pro- 
posed by [46] under the denomination of nested sampling; however, {10] have 
shown in the case of mixtures that this technique could lead to uncertainties 
about the quality of the approximation. 

We consider here a further solution, first proposed by [8], that is straight- 
forward to implement in the setting of mixtures (see [9] for extensions). 
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Although it came under criticism by [36] (see also [19]), we show below 
how the drawback pointed by the latter can easily be removed. Chib’s ([8]) 
method is directly based on the expression of the marginal distribution 
(loosely called marginal likelihood in this section) in Bayes’ theorem: 


fa(xlAz) m(AZ) 
Ty(Az|x) 


and on the property that the rhs of this equation is constant in A;7. There- 


my(a = 


fore, if an arbitrary value of A,7, X37 say, is selected and if a good approxima- 
tion to 77(Az|a) can be constructed, 77(Az|a), Chib’s ([8]) approximation 
to the marginal likelihood is 
ingle) — L2C@AD (AS) 
i (Aj|@) 


In the case of mixtures, a natural approximation to 77(Az|x) is the Rao- 


(8.8) 


Blackwell estimate 
1 T 
Ra(XGle) = = Do ms(AG|e,2), 
t=1 


where the z“)’s are the latent variables simulated by the MCMC sampler. 
To be efficient, this method requires 


(a) a good choice of A} but, since in the case of mixtures, the likelihood 
is computable, A% can be chosen as the MCMC approximation to the 
MAP estimator and, 

(b) a good approximation to 77(Az|a). 


This later requirement is the core of Neal’s ([36]) criticism: while, at 
a formal level, 7;(A¥|x) is a converging (parametric) approximation to 
my(Az\|a) by virtue of the ergodic theorem, this obviously requires the 
chain (z“)) to converge to its stationarity distribution. Unfortunately, as 
discussed previously, in the case of mixtures, the Gibbs sampler rarely 
converges because of the label switching phenomenon described in Section 
8.3.1, so the approximation 7(A7|x) is untrustworthy. [36] demonstrated 
via a numerical experiment that (8.8) is significantly different from the true 
value mjz(a) when label switching does not occur. There is, however, a fix 
to this problem, also explored by [2], which is to recover the label switching 
symmetry a posteriori, replacing 77(A%|x) in (8.8) above with 


i l 
A * | * 
ty(Ajl@) = SS YoraleAlz,2©), 


aE€G, t=1 
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where G; denotes the set of all permutations of {1,..., J} and o(A3) de- 
notes the transform of A where components are switched according to the 
permutation o. Note that the permutation can equally be applied to A% 
or to the z)’s but that the former is usually more efficient from a com- 
putational point of view given that the sufficient statistics only have to be 
computed once. The justification for this modification either stems from a 
Rao-Blackwellisation argument, namely that the permutations are ancillary 
for the problem and should be integrated out, or follows from the general 
framework of [26] where symmetries in the dominating measure should be 
exploited towards the improvement of the variance of Monte Carlo estima- 
tors. 


Example 8.12. (Example 8.8 continued) In the case of the normal 
mixture case and the galaxy dataset, using Gibbs sampling, label switch- 
ing does not occur. If we compute logm s(x) using only the original es- 
timate of [8], (8.8), the [logarithm of the] estimated marginal likelihood 
is pz(x) = —105.1396 for J = 3 (based on 10° simulations), while intro- 
ducing the permutations leads to pjz(a) = —103.3479. As already noted 
by [36], the difference between the original Chib’s ({8]) approximation and 
the true marginal likelihood is close to log(J!) (only) when the Gibbs sam- 
pler remains concentrated around a single mode of the posterior distribu- 
tion. In the current case, we have that —116.3747 + log(2!) = —115.6816 
exactly! (We also checked this numerical value against a brute-force esti- 
mate obtained by simulating from the prior and averaging the likelihood, 
up to fourth digit agreement.) A similar result holds for J = 3, with 
—105.1396 + log(3!) = —103.3479. Both [36] and [19] also pointed out that 
the log(J!) difference was unlikely to hold for larger values of J as the modes 
became less separated on the posterior surface and thus the Gibbs sampler 
was more likely to explore incompletely several modes. For J = 4, we get 
for instance that the original Chib’s ([8]) approximation is —104.1936, while 
the average over permutations gives —102.6642. Similarly, for J = 5, the 
difference between —103.91 and —101.93 is less than log(5!). The log(J!) 
difference cannot therefore be used as a direct correction for Chib’s ([8]) ap- 
proximation because of this difficulty in controlling the amount of overlap. 
However, it is unnecessary since using the permutation average resolves 
the difficulty. Table 8.3 shows that the preferred value of J for the galaxy 
dataset and the current choice of prior distribution is J = 5. < 


When the number of components J grows too large for all permutations 
in G7 to be considered in the average, a (random) subsample of permuta- 
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Table 8.3. Estimations of the marginal likelihoods by the symmetrised Chib’s ap- 
proximation (based on 10° Gibbs iterations and, for J > 5, 100 permutations selected 
at random in G7). 


—115.68 —103.35  —102.66  —101.93  —102.88  —105.48 —108.44 


tions can be simulated to keep the computing time to a reasonable level 
when keeping the identity as one of the permutations, as in Table 8.3 for 
J = 6,7. (See [2] for another solution.) Note also that the discrepancy 
between the original Chib’s ([8]) approximation and the average over per- 
mutations is a good indicator of the mixing properties of the Markov chain, 
if a further convergence indicator is requested. 


Example 8.13. (Example 8.6 continued) For instance, in the setting 
of Example 8.6 with a = b = 1, both the approximation of [8] and the 
symmetrized one are identical. When comparing a single class model with 
a two class model, the corresponding (log-)marginals are 


= —552.0402 


. 7 T(1) [(n,+1/2)0(n-—n; + 1/2) 
a) =] Ta /2)2 Tint) 


and f2(x%) % —523.2978, giving a clear preference to the two class model. 
< 


i=1 
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The paper is a review of results on the asymptotic behavior of Markov 
processes generated by i.i.d. iterates of monotone maps. Of particular 
importance is the notion of splitting introduced by [12]. Some exten- 
sions to more general frameworks are outlined, and, finally, a number of 
applications are indicated. 


9.1. Introduction 


This paper is an impressionistic overview of some results on Markov pro- 
cesses that arise in the study of a particular class of random dynamical 
systems. A random dynamical system is described by a triplet (.S,T, Q) 
where S is the state space (for example, a metric space), an appropriate 
family of maps on S into itself (interpreted as the set of all possible laws of 
motion) and Q is a_ probability measure on (some o-field of) I. 

The evolution of the system can be described as follows: initially, the 
system is in some state x; an element a, of [is chosen randomly according 
to the probability measure Q and the system moves to a state X; = a1(x) 
in period one. Again, independently of a ,, an element a2 of I is chosen 
according to the probability measure @ and the state of the system in period 
two is obtained as X2 = a2(ai(x)). In general, starting from some z in S, 
one has 


Xnti(#) = Anti (Xn(2)), (1.1) 


where the maps (a,) are independent with the common distribution Q. 
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The initial point x can also be chosen (independently of (a,,)) as a random 
variable Xo. The sequence X,, of states obtained in this manner is a Markov 
process and has been of particular interest in developing stochastic dynamic 
models in many disciplines. With specific assumptions on the structure of 
S and IT it has been possible to derive strong results on the asymptotic 
behavior of X,,. 

Random dynamical systems have been particularly useful for model- 
ing long run evolution of economic systems subject to exogenous random 
shocks. The framework (1.1) can be interpreted as a descriptive model; 
but, one may also start with a discounted (stochastic) dynamic program- 
ming problem, and directly arrive at a stationary optimal policy function, 
which together with the exogenously given law of transition describes the 
optimal evolution of the states in the form (1.1). Of particular significance 
are results on the “inverse optimal problem under uncertainty” due to [20] 
and [22] which assert that a very broad class of random systems (1.1) can 
be so interpreted. 

The literature exploring (1.1) is already vast and growing. Given the 
space limitations, this review is primarily restricted to the case when S is 
an interval (non-degenerate) in R, or a closed (nonempty) subset of R’, and 
T is a family of monotone maps from S into S. Some extensions to more 
general framework and applications are also outlined. Here I touch upon a 
few of the issues and provide some references to definitive treatments. 

(i) The existence, uniqueness and global stability of a steady state (an 
invariant distribution) of random dynamical systems: Significant progress 
has been achieved when the laws of motion satisfy either some “splitting” 
or “contraction” conditions (see, e.g., [12], [11] [6, 7] and the review in [9], 
Chapter 3). An awkward problem involving the existence question is worth- 
noting. Consider S = [0,1] or S = Ry and assume that y(0) = 0 for all 
y € TIT. This is a natural property of a law of motion in many population 
or economic models (viewed as a production function, y(0) = 0 means that 
zero input leads to zero output). The point mass at 0 (the measure 60) 
is obviously an invariant distribution. The challenge, then, is to find an 
invariant distribution with support in (0, 1). 

(ii) The nature of the invariant distribution. Suppose, for concreteness, 
that S is an interval, and F is the distribution function on R of the unique 
invariant measure. Invoking a standard decomposition property (see [18], 
p. 130, 196), let (i) Fy be the step part (a step function); (ii) Fac be the 
absolutely continuous part (with respect to the Lebesgue measure) and 
(iii) F, be the singular part of F. 
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As a first step one would like to know whether (i) F’ is continuous 
(Fa = 0) or whether (ii) F' is absolutely continuous or whether (iii) F’ is 
singular. At the next step, one would like to ask questions of comparative 
statics: how does F (or the components (i)—(iii)) change if a parameter in 
the model is allowed to change? Finally, one would like to compute (or 
approximate) F’ but that typically requires more structure on the model. 

All the questions are elusive. Take the standard approach of describing a 
Markov process with state space S = R, and a transition function p(x, A). If 
for each x € S, p(x,.) is absolutely continuous with respect to the Lebesgue 
measure, then if 7 is invariant under p(x, A), 7 is also absolutely continuous 
with respect to the Lebesgue measure |see [9], Proposition 5.2 of Chapter 
5]. This result is to be contrasted with those in Section 9.2. 

A study of (i.i.d) random iteration of quadratic maps (S = [0,1], T = 
{f : f(x) = 6x(1 — 2), 0 < 6 < 4}, Q with a two point support) was 
initiated by [5]. The subsequent literature offers interesting examples on 
applications of splitting and open questions. For a review of results when 
I is the quadratic family (the typical y(a) = 0x(1 — x) does not satisfy 
the monotonicity property that is central here but does have ‘piecewise 
monotonicity’ which has often been used to invoke the splitting conditions: 
see [1]; further extensions are in [3]). 

The processes considered in this article particularly when I is finite are 
not in general Harris irreducible (see, e.g., [23] for a definition of Harris 
irreducibility). Therefore, the standard techniques used for the study of 
irreducible Markov processes in the literature are not applicable to many 
of the cases reviewed. This point was explored in detail in [13] who con- 
cluded that “it is surprising and unfortunate that the large classical theory 
based on compactness and/or irreducibility conditions generally give little 
information about (1.1) as a population model.” The reader interested in 
this issue is referred to [13], Section 5. 


(iii) Applications of the theoretical results to a few topics: 


(a) turnpike theorems in the literature on descriptive and optimal growth 
under uncertainty: when each admissible law of motion is monotone in- 
creasing, and satisfies the appropriate Inada-type ‘end point’ condition, 
Theorem 9.1 can be applied directly. 

(b) estimation of the invariant distribution: as noted above, an important 
implication of the “splitting theorems” is an estimate of the speed of 
convergence. This estimate is used in Section 9.5 to prove a result on 
/n-consistency of the sample mean as an estimator of the expected 
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long run equilibrium value (i.e., the value of the state variable with 
respect to the invariant distribution). 


9.2. Random Dynamical Systems 


We consider random dynamical systems. Let S be a metric space and S 
be the Borel o-field of S. Endow T° with a o-field © such that the map 
(y,2) — (7(z)) on (T x S, & @S into (S,S) is measurable. Let Q be a 
probability measure on ([',¥). On some probability space (Q,F , P) let 
(Q,)°2, be a sequence of independent random functions from [ with a 
common distribution Q. For a given random variable Xo (with values in 


S'), independent of the sequence (a,,)°°,, define 


X1, =a (Xo) = a1 Xo (2.1) 


Xn4i = An41(Xn) = An41An+++01Xo 22) 


We write X,,(x) for the case Xo = x; to simplify notation we write X, = 
Qyn-++:a1,Xo for the more general (random) Xo. Then X, is a Markov 
process with the stationary transition probability p(a, dy) given as follows: 


forxES,CeES, 


p(z,C) = O({y €T : (x) € C}) (2.3) 


The stationary transition probability p(«, dy) is said to be weakly contin- 
uous or to have the Feller property if for any sequence x, converging to x, 
the sequence of probability measures p(x, -) converges weakly to p(z,-). 
One can show that if [ consists of a family of continuous maps, p(x, dy) 
has the Feller property. 


9.3. Evolution 


To study the evolution of the process (2.2), it is convenient to define the 
map T™* [on the space M(S) of all finite signed measures on (S, S)| by 


T*u(C) = i. p(ae, C)u(de) = f wy C)Q(dy), we M(S). (3.1) 


Let P(S) be the set of all probability measures on (S, S). An element 7 
of P(S) is invariant for p(x, dy) (or for the Markov process X,,) if it is a 
fixed point of 7™, i.e., 


mT is invariant if f P= aT (3.2) 
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Now write p(x, dy) for the n-step transition probability with p™ = 
p(a, dy). Then p(x, dy) is the distribution of an ---a,x2. Define T*” 
as the n-th iterate of 7”: 


T*" yp = T*—-V(T*p) (n > 2), T*! = T*, T* = Identity (3.3) 


Then for any C€ S, 


(T*"u)(C) = i p(e, C)yu(de), (3.4) 


so that T*" is the distribution of X, when Xo has distribution p. To 
express T*” in terms of the common distribution Q of the i.i.d. maps (a,,), 
let I” denote the usual Cartesian product [ x I x --- x I’ (n terms), and 
let Q” be the product probability Q x Q x --- x Q on (I'”, S®”) where 
S®”" is the product o-field on [". Thus Q” is the (joint) distribution 
of a = (a1,02,...,Q). For y = (71,72;-+.;%n)€I” let Y denote the 
composition 


Y= Mn Yn-10-°- 11 (3.5) 


We suppress the dependence of 7 on n for notational simplicity. Then, 
since T*”" yw is the distribution of X, = an-++a Xo, one has (T*")(A) = 
Prob(Xpea A), where @ = QnQn_1-::Q4. Therefore, by the indepen- 
dence of a and Xo, 


(T*")(A) = | WY AQ"(dy) (AeS, weP(S)). (3.6) 


n 


Finally, we come to the definition of stability. A Markov process X,y, 
is stable in distribution if there is a unique invariant probability measure 
m such that X,,(a) converges weakly (or, in distribution) to 7 irrespec- 
tive of the initial state x, i.e., if p[™(x, dy) converges weakly to the same 
probability measure 7 for all x. 

In what follows, if g is a bounded S-measurable real valued function on 
S, we write 


T9(2) = [ oly) pla, dy) (3.7) 
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9.4. Splitting 


If S is a (nonempty) compact metric space and [I consists of a family of 
continuous functions from S into S, then a fixed point argument ensures 
that there is an invariant probability measure 7*. However, when I consists 
of monotone maps on a suitable subset S of R® (into 9), stronger results 
on uniqueness and stability can be derived by using a ‘splitting’ condition, 
first studied by [12]. 


9.4.1. Splitting and Monotone Maps 


Let S be a nondegenerate interval (finite or infinite, closed, semiclosed, or 
open) and T a set of monotone maps from S into S; i.e., each element of T 
is either a nondecreasing function on S' or a nonincreasing function. 

We assume the following splitting condition: 
(H) There exist z9 € S,X >0 and a positive N such that 


X; 


(1) P(lanan-1---:a12 < zVz € S) 
( x. 


2) P(lanan-1°-:Q1x > zVar € S) 
Note that conditions (1) and (2) in (H) may be expressed, respectively, 
as 
Q*({yer™ :F 1" @ eS:2< 2] =S})>X, (4.1) 
and 


QX ({y eT : F771 [2 €S: 2 > x] = S}) > X. (4.2) 


Recall that y = ynyn-1°°:71- 
Denote by dx (1, v) the Kolmogorov distance on P(S). That is, if F,, Fy 
denote the distribution functions (d.f.) of «4 and v, respectively, then 


a 


= sup Ele) — Fee) ste © PUG): (4.3) 


Remark 4.1. First, it should be noted that convergence in the distance 
dx on P(S) implies weak convergence in P(S). Secondly, (P(S), dx) isa 
complete metric space.(See [9], Theorem 5.1 and C11.2(d) of Chapter 2). Mf 
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Theorem 9.1. Assume that the splitting condition (H) holds. Then 


(a) the distribution T*"p of Xn := Qn-+++Aa1Xo converges to a probability 
measure 7 on S in the Kolmogorov distance dx irrespective of Xo. 
Indeed, 


d(T", 7) < (1—X)"/"I Vy € P(S) (4.4) 


where |y] denotes the integer part of y. 
(b) a in (a) is the unique invariant probability of the Markov process Xp. 


Proof. [Main Steps] Careful calculations using the splitting condition 
and monotonicity lead to (see [9], Chapter 3, Theorem 5.1): 


dx (T*yu,T*v) < dx(p,v) (4.5) 
and 
dx (T*% p, T*%v) < (1—X)dx(u,v) (u,v € P(S)). (4.6) 


That is, T*% is a uniformly strict contraction and T* is a contraction. 
As a consequence, V, > N, one has 


dx Cry, Ty 2 ie ‘Wiad © Sage)» Pe yh) 
<< (1 _ Vda (TP) 1, ‘ine amell i ies 
£0. — Pl (Te -Pe)g 7ee bes) 5) 


< (1-2) dx (u,v). (4.7) 


Now, by appealing to the contraction mapping theorem, T*™ has a unique 
fixed point 7 in P(S), and T*\(T*r) = T*(T*\ 1) = T*x. Hence T*r is 
also a fixed point of T*%. By uniqueness T*7 = 7. Hence, 7 is a fixed 
point of T*. Any fixed point of T* is a fixed point of T*%. Hence 7 is 
the unique fixed point of T*. Now take v = a in (4.7) to get the desired 
relation (4.4). O 


The following remarks clarify the role of the splitting condition. 


Remark 4.2. Let S = [a,b] and a,,(n > 1) a sequence of i.i.d. continuous 
nondecreasing maps on S' into S. Suppose that a is the unique invariant 
distribution of the Markov process. If 7 is not degenerate, then the splitting 
condition holds [[12], Theorem 5.17; for relaxing continuity, see [9], Lemma 
CS.2 of Chapter 3]. a 
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Remark 4.3. Suppose that a, are strictly monotone a.s. Then if the 
initial distribution jz is nonatomic (i.e., u({x}) = OVzx or, equivalently the 
d.f. of is continuous), ~oy~' is nonatomic Vy € T (outside a set of zero 
Q-probability). It follows that if Xo has a continuous d.f., then so has X1 
and in turn X92 has a continuous d.f., and so on. Since, by Theorem 9.1, 
this sequence of continuous d.f.s (of X;,(n > 1)) converges uniformly to the 
d.f. of 7, the latter is continuous. Thus 7 is nonatomic if an are strictly 
monotone a.s. | 


Example 4.1. Let S = [0,1] andT be a family of monotone nondecreasing 
functions from S into S. As before, for any z € S, let 


Kyl 2) = Ginn O42, 
One can verify the following two results: 
[R.1] P[X,(0) < 2] is nonincreasing in n and converges for each x € S. 


[R.2] P[X,(1) < xlis nondecreasing in n and converges for each x € S. 
Write 


1 Hy 1 
avs IS a 
fa)=Jh+e ith<e<d, 


Verify that f is a monotone increasing map from S into S, but f is not 
continuous. One can calculate that 


F(e) 0 if0<a#<, F(a) 0 if0<a<2, 
rv) = GC) = 
° 1 ifi<ae<i. 1 if2<2<1. 


Neither Fo nor F{ is a stationary distribution function. | 


Example 4.2. Let S = [0,1] and T = {fi, fo}. In each period f; is 
chosen with probability t fi is the function f defined in Example 4.1, and 
Ee) = 3 PG a0t G1 S. 
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Then 
0 f0<2<}, 
1 if4<a<l. 


Fo(«) = Fi(x) = 
and Fo(x) is the unique stationary distribution. Note that f1($) = fo($) = 
$ i.e., f; and fg have a common fixed point. Examples 4.1 and 4.2 are 
taken from [30]. a 


We now turn to the case where the state space is a subset of R&(¢ > 1) 
satisfying the following assumption: 
(A.1) S is a closed subset of R°. 

Let [ be a set of monotone maps y on S into S, under the partial order: 
Me Vita, = y for 17 20 RS (Cie egy) ¥ = (yi, Y2,--», ye) eR (or 
S). That is, either 7 is monotone increasing: y( x) < y(y) ifx < y, ory 
is monotone decreasing: y(y) < y(x) ifx < y; x, yeS. 

On the space P(S), define, for each a > 0, the metric 


[sau f adv 


where G, is the class of all Borel measurable monotone (increasing or de- 
creasing) functions g on S into |0,a]. The following result is due to [10], 
who derived a number of interesting results on the metric space (P(S), da). 


da(H, V) = sup ’ (u,veP(S)), (4.8) 


9€Ga 


One can show that convergence in the metric d, implies weak convergence 
if (A.1) holds (see [9], pp. 287-288). 


Lemma 9.1. Under the hypothesis (A.1), (P(S),da) is a complete metric 
space. 


Consider the following splitting condition (H’). To state it, let 7 be as 
n (3.5), but with n = N:¥ = ywyn-1°°°'N1 for y = (11,72,---, nN) ETN. 
(H’) There exist F,e 37°’ (i = 1,2) for some N > 1, such that 


(i) 6; = QN (F;) > 0 (i = LZ), and 


(ii) for some xoe S, one has 


V(x) < x0 VxeS, Vye Fi, 


V(x) > Xo VxeS, Vye Fo, 


Also, assume that the set Hy = {ye : 7 is monotone increasing}e S.°. 
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Theorem 9.2. Let {a, :n > 1} be a sequence of i.i.d. measurable mono- 
tone maps with a common distribution Q. Assume (A.1) and (H’) hold. 
Then there exists a unique invariant probability measure for the Markov 
process (2.1) and 


sup di(p(x,.), 7) < (1-6) [*](n > 1), (4.9) 


where 6 = min{d, 62}, and [+] is the integer part of x. 


Proof. The proof uses Lemma 9.1 and is spelled out in [9]. As in the 
case of Theorem 9.1, we prove: 
Step 1. T*% is a uniformly strict contraction on (P(S), d1) 

In other words, 


d(T*™ py, T*Nv) < (1 — 4)di (u,v), Vu, vep(S). (4.10) 
Now, Step 2. Apply the Contraction Mapping Theorem. O 


For earlier related results see [4]. 


9.4.2. An Extension and Some Applications 


An extension of Theorems 9.1—9.2 [proved in [6]] is useful for applications. 
Recall that S is the Borel o-field of the state space S. Let AC S, define 


(pm, V) 2= ye (A) —V(A)| (wu, veP('S)). (4.11) 
(1) Consider the following hypothesis (H;) : 
(P(S),d) is a complete metric space; (4.12) 
(2) there exists a positive integer N such that for all yeI, one has 
d(u 7", vy") <d(u,v) (u,veP(S)) (4.13) 
(3) there exists 6 > 0 such that VAeA, and with N as in (2), one has 
P(@~'(A) = S or ¢)>6>0 (4.14) 


Theorem 9.3. Assume the hypothesis (H,). Then there exists a unique 
invariant probability a for the Markov process Xn := Qn-+++a,Xo, where 
Xo is independent of {a, :=n > 1}. Also, one has 

a(T*" p,m) < (1-6/1 (weP(S)) (4.15) 


where T*" is the distribution of X, when Xo has distribution pw, and [n/N] 
is the integer part of n/N. 
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Remark 4.4. For applications of Theorem 9.3 to derive a Doeblin-type 
convergence theorem, and to the study of non-linear autoregressive pro- 
cesses see [9]. al 


9.4.3. Extinction and Growth 


Some light has been thrown on the possibilities of growth and extinction. 
To review these results (see [13] for proofs and other related results), let 
us assume that S = [0,00), and [ consists of a family of maps f : S — S$ 
satisfying 


C.1 f(x) is continuously differentiable and strictly increasing on [0, 00) 


C.2 [2-1 f(x)] <0 for x > 0 (concavity) 


C.3 There is some K > 0 such that f(k) < K for all f € T (note that Kk 
is independent of f) 


Then we have the following: 
Theorem 9.4. Suppose 0 < Xqg < K with probability one. Then: 


(a) X, converges in distribution to a stationary distribution; 

(b) The stationary distribution is independent of Xo and its df has F(0*) = 
0 or 1 [F(0+) = 1 means that X, “> 0, which is extinction of the 
population. 


It is often useful to study the non-linear stochastic difference equation 
written informally as: 


Paget = fl Ans O44) 


where (6,,) is a sequence of independent, identically distributed random 
variables taking values in a (nonempty) finite set A C R44. Here f : 
R, x A — Rx satisfies, for each 6 € A the conditions (C.1)—(C.2). Write 
R(z, 0) = 2" f(a, 0) for x > 0. 

For each 0 € A, let 


R(0,9) = lim R(z,6) 


x—O0t 


and 


R(oo, 0) = lim R(x, 0) = f'(z,6) 


wL— CO 
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Define the growth rates 
vo = E[(log R(0, 4)] 
and 
Voo = E|(log R(oo, 6)] 
By C.2 vo and vg are well-defined. 


Theorem 9.5. Under assumptions C.1-C.2 and 0 < Xo < 00 with proba- 
bility one, 


(a) ifuo <0, X, — 0 with probability one; 

(b) if Voc 20, Xn — co with probability one; 

(c) if vo > 0, Vo <0, Xn converges weakly (independently of the distribu- 
tion of Xo) to a distribution with support in (0,00). 


9.5. Invariant Distributions: Computation and Estimation 


The problem of deriving analytical properties of invariant distributions has 
turned out to be difficult and elusive. In this section we provide an example 
of a class of Markov processes in which the unique invariant distribution 
can be completely identified. 

Let Z,,Z2..., be a sequence of non-negative i.i.d. random variables. 
Consider the Markov Chain {X, : n = 0,1,2...} on the state space S = 
R44 defined by 


Xn41 — Zn+1 + Lf) n=O 


where Xo is a strictly positive random variable independent of the sequence 
{Z;}. We first summarize the dynamic behavior of the sequence {X;, }. 


Theorem 9.6. Assume that {Z;} are non-degenerate. Then the Markov 
chain {Xn, n= 0,1} on S = Ry has a unique invariant probability 7, and 
d;(T*" 4,7) converges to zero exponentially fast, irrespective of the initial 
distribution and the invariant probability 7 1s non-atomic. 


Proof. The main step in the proof is to represent X,, as 
De = An-An-1°''Q1 (Xo) 


where a,(x) = Z, + 1/x, n > 1. The maps a, are monotone decreasing 
on S. The splitting condition can also be verified (see [15], Theorem 4.1). 
Hence Theorem 9.1 can be applied directly. L) 
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Suppose that the common distribution of Z; is a Gamma distribution. 
Recall that the Gamma distribution with parameters \ > 0 and a > 0 is 
the distribution on R;+ given by the density function 


if zeRiy 


0 otherwise 


where [(-) is the gamma function: 
T(3) =| t?—1e-" da 
0 


Theorem 9.7. Suppose that the common distribution of the 1.1.d. sequence 
{Z;} is a Gamma distribution with parameters X anda. Then the invariant 
probability 7 on (0,00) is absolutely continuous with density function 


9r,a(%) = (2K y(2a))-*a* tet 2), weR ys 


where K (-) denotes the Bessel function, i.e, K)(z) = 4 f. 
e722") dx, 


Another interesting example corresponds to Bernoulli Z; : P(Z; = 0) = 
p, P(Z;, = 1) = 1-—p(0 < p < 1). In this case the unique invariant 
distribution 7 is singular with respect to Lebesgue measure, and has full 
support on S = (0,00). An explicit computation of the distribution function 
of 7, involving the classical continued fraction expansion of the argument, 
may be found in [15], Theorem 5.2. 


9.5.1. An Estimation Problem 


Consider a Markov chain X,, with a unique stationary distribution 7. Some 
of the celebrated results on ergodicity and the strong law of large numbers 
hold for z-almost every initial condition. However, even with [0,1] as the 
state space, the invariant distribution 7 may be hard to compute explic- 
itly when the laws of motion are allowed to be non-linear, and its support 
may be difficult to characterize or may be a set of zero Lebesgue measure. 
Moreover, in many economic models, the initial condition may be histori- 
cally given, and there may be little justification in assuming that it belongs 
to the support of 7. 

Consider, then, a random dynamical system with state space |c, d] 
(without loss of generality for what follows choose c > 0). Assume I 
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consists of a family of monotone maps from S with S, and the splitting 
condition (H) hold. The process starts with a given x. There is, by The- 
orem 9.1, a unique invariant distribution 7 of the random dynamical sys- 
tem, and (4.4) holds. Suppose we want to estimate the equilibrium mean 


n—1 
J ya (dy) by sample means 4 25 XG We say that the estimator = 2G is 
j= j= 


J/n-consistent if 


n—1 
: dG = J veay) + Op(n-/?) (5.1) 
j=0 


where O,(n~1/*) is a random sequence €,, such that |e, -n'/?| is bounded in 


probability. Thus, if the estimator is \/n-consistent, the fluctuations of the 
empirical (or sample-) mean around the equilibrium mean is O,(n~!/?). We 
can establish (5.1) by using (4.4). One can show that (see [7], pp. 217-219) 
if 


f(z) =2— Jf yn(dy) 


then 
sup ) [T"f(a)|<(d-c) SS A-SI 40 asm 
# n=m+1 n=m+1 


Hence, g = — >> T' f [where T° is the identity operator J] is well-defined, 
n=0 
and g, and Tg are bounded functions. Also, (T —J)g = —>>T"f + 
n=1 


CO 
\> TN f =f. Hence, 
n=0 


S- £(%) = NT - Ng(X) 
j=0 j=0 
= Sr((T9)(X5) - 9X5) 
j=0 
= dP9)(Xj-1) — 9 X3)] + g(Xn) — 9(Xo) 


By the Markov property and the definition of Tg it follows that 


E((Tg)(Xj-1) — g(Xj) |Fj-1) = 0 
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where F,. is the o-field generated by {X; :0 <j <r}. Hence, (Tg)(X;~-1)— 
g(X;)(7 => 1) is a martingale difference sequence, and are uncorrelated, so 
that 


n 


k 
El) | (T9(X;-1) — 9(X3))?? = 95 E((L9)(Xj-1) — 9(X,))? (5.2) 
j=l j=l 
Given the boundedness of g and Tg, the right side is bounded by n.a for 
some constant a. It follows that 


BO F(X) <n foralln 
where 77 is a constant that does not depend on Xo. Thus, 
l n—1 
isa 7 2 / 
BCX; — f um(dy))? <ut/n 
j=0 
which implies, 
1 n—-1 
A XG = J vray) + 0p(n-/?) 
j=0 


For other examples of \/n-consistent estimation, see [2] [and [8], 
Chapter 5]. 


9.6. Growth Under Uncertainty 


9.6.1. A Stochastic Stability Theorem in a Descriptive 
Model 


Models of descriptive as well as optimal growth under uncertainty have led 
to random dynamical systems that are stable in distribution. We look at a 
“canonical” example and show how Theorem 9.1 can be applied. We begin 
with a descriptive growth model and follow it up with an optimization 
problem. 

As a matter of notation, for any function h on S into S, we write h\™ 
for the nth iterate of h. Think of ‘x’ as per capital output of an economy. 

Let S = Ry; andl = {Fi, fy,...,Fi,..., Fn} where the distinct laws 
of motion F; satisfy: 

F.1. F; is strictly increasing, continuous, and there is some r; > 0 
such that F(x) >a on (0,7;) and F;,(a) < @ for x > ';. 


218 M. Majumdar 


Note that F;(r;) = 7; for alli =1,...,N. Next, assume: 

BQ. 3 Fp TOPs. SF. 

In other words, the unique positive fixed points r; of distinct laws of 
motion are all distinct. We choose the indices 7 = 1,2,..., N so that 


Ty, <q < +s < TN 


Let Prob (a, =F) =~; > 0G 47). 

Consider the Markov process {X,(x)} with the state space (0,00). If 
y > 71, then Fi(y) > Fi(ri) > ri for i = 2,...N, and Fi(ri1) = ri, so 
that X,(x) > 7, for alln > Oif > r1. Similarly, if y < ry, then 
Fi(y) < Fi(rn) < ry fori = 1,...,N —1 and Fy(rn) = rn, so that 
Xn(xz) < ry for all n > 0 if e < ry. Hence, if the initial state x is in 
[r1, Tn], then the process {X,(x) : n > 0} remains in [r1,rjy] forever. We 
shall presently see that for a long run analysis we can consider |r), rj] as 
the effective state space. 

We shall first indicate that on the state space bras ry] the splitting con- 
dition (#) is satisfied. If x >, Fi(x) < a, Fy?) (x) < F(x) etc. The 
limit of this decreasing sequence Fr” (x) must be a fixed point of F,, and 
therefore must be r;. Similarly, if ¢ < ry, then F(x) increases to ry. In 
particular, 


Jim, FP (ry) =n, lim, FO (r1) =ry. 
Thus, there must be a positive integer ng such that 
FOO (ry) < FP (r1). 
This means that if 29 € [F("" (ry), FO) (ry), then 


Prob(Xno(x) < zo Vaelri,rn]) 

> Prob(a, = Fi forl <n <n)=—p,° >0 
Prob(Xno(z) = 2 V2elri,Tn]) 

> Prob(an = Fy forl<n<no) =py > 0 


Hence, considering |r), 7] as the state space, and using Theorem 9.1, there 
is a unique invariant probability 7 with the stability property holding for 
all initial xe[ri, ry]. 

Now, define m(x) =_ min 5 (x), and fix the initial state xe(0, 11). 


geese 


One can verify that (i) m is continuous; (ii) m is strictly increasing; 
(iii) m(r1) = r1 and m(a) > a for ze(0,r1), and m(a) < «& for x > ry. 
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Clearly m‘ (x) increases with n, and m\")(x) < ry. The limit of the 
sequence m‘”) (a) must be a fixed point, and is, therefore r,. Since F;(r;) > 
r; fori = 2,...,.N, there exists some € > 0 such that F;(y) > 71(2 <i < N) 
for all ye[r; —€,71]. Clearly there is some n- such that m™(x) > 1, —e. If 
7 =inf{n >1: Xp(x) > 11} then it follows that for all k > 1 


Prob(m > nz +k) < pr. 


Since p¥ goes to zero as k — ov, it follows that 7) is finite almost surely. 
Also, X,,(a) < ry, since for y < ri, (i) Fi(y) < Fi(rn) for all i and (ii) 
Fy(rn) < ry fori =1,2,...,N—1 and Fy(rny) = ry. (In a single period 
it is not possible to go from a state less than r; to one larger than ry). By 
the strong Markov property, and our earlier result, X;4 (a2) converges in 
distribution to 7 as m — oo for all xe(0,71). Similarly, one can check that 
as n — oo, X,(x) converges in distribution to 7 for all x > ry. a 


Note that in growth models, the condition F.1 is often derived from 
appropriate “end point” or Uzawa-Inada conditions. It should perhaps be 
stressed that convexity assumptions have not appeared in the discussion of 
this section so far. Of course, in models of optimization, F; is the optimal 
transition of the system from one state into another, and non-convexity 
may lead to a failure of the splitting condition (see [19] for details). 


9.6.2. One Sector Log-Cobb-Douglas Optimal Growth 


Let us recall the formulation of the one-sector growth model with a Cobb- 
Douglas production function G(x) = «°,0 < a < 1, with a representative 
decision maker’s utility given by u(c) = In c. Following [21], suppose that an 
exogenous perturbation may reduce production by some parameter 0 < k < 
1 with probability p > 0 (the same for all t = 0,1,...). This independent 
and identically distributed random shock enters multiplicatively into the 
production process so that output is given by G(x) = ra® where r € {k, 1}. 
The dynamic optimization problem can be explicitly written as follows: 


max Eo >». Bln ¢ 
t=0 


where 0 < 2 < 1 is the discount factor, and the maximization is over all 
consumption plans c = (co,ci,...) such that for t = 0,1,2,... 


C= Mt — 41, &¢ 20, 2% 20 
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and 20, ro are given. 
It is well known that the optimal transition of 7; is just described is 
g(x,r) = ara i.e., the plan x; generated recursively by 


Lt41 = Ol Get) = aBr xy 


is optimal. 
Consider now the random dynamical system obtained by the following 
logarithmic transformation of 2;: 


l-a Ina+In @ 
ae See a Oe 


The new variable y;, associated with x; evolves according to a linear policy, 
so that 


lnr 
tesa = aye + (=a) (1 =) 


which can be rewritten as 


Yt+1 = aye with probability p 
Yir1 = ays + (1-2) with probability 1 — p 


Define the maps Yo, 71 from [0, 1] to [0,1] by 


yo(y) = ay 
ere = ay + (1—a) en 


It is useful to note here that the map yo corresponds to the case where 
the shock, r, takes the value k; and the map 7, corresponds to the case 
where the shock, r, takes the value 1. Denote (p,1—p) by (po, pi). Then 
S = [0,1], T = {70,1}, together with Q = {po, pi} is a random dynamical 
system. The maps 9, for 7 € {0,1}, are clearly affine. 


9.6.2.1. The Support of the Invariant Distribution 


Let a be the unique invariant distribution, F,,, its distribution function. 
The graphs of the functions show that for 0 < a < 1/2, the image sets 
of the two functions yo and y; are disjoint, a situation which can be de- 
scribed as the “non-overlapping” case. In this case, the “gap” between the 
two image sets (in the unit interval) will “spread” through the unit inter- 
val by successive applications of the maps (6.1). Thus, one would expect 
the support of the invariant distribution to be “thin” (with zero Lebesgue 
measure). 
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On the other hand, for 1/2 < a < 1, the image sets of the functions 
yo and 7, have a non-empty intersection. We can refer to this as the 
“overlapping” case. Here, the successive iterations of the overlap can be 
expected to “fill up” the unit interval, so the invariant distribution should 
have full support. 

The above heuristics are actually seen to be valid. 

It is important to remark that this result does not depend on the mag- 
nitude of the discount factor 3 nor on the amplitude of the shock k, but 
only on the technological parameter a. The discount factor 3 only shifts 
the support of the invariant distribution of the original model over the real 
line, while the exogenous shock k affects its amplitude. The stream of re- 
search has been striving around the fundamental question on deciding for 
what values of a, the invariant F, is absolutely continuous, and for what 
values of a, F, is singular. For an exhaustive mathematical survey on the 
whole history of Bernoulli convolutions, see [24]. It is known, in the sym- 
metric case p = oT that the distribution function is “pure”; that is, it is 
either absolutely continuous or it is singular (Jessen and Wintner [1935]). 
Further, Kershner and Wintner [1935] have shown that if 0 < a < 1/2, 
the support of the distribution function is a Lebesgue-null Cantor set and, 
therefore, the distribution function is singular. For a = s one gets the 
uniform distribution, which is not singular. 

For the symmetric case p = oT denote by S, the set of a € (1/2, 1) 
such that F, is singular. It was conjectured that the distribution function 
should be absolutely continuous with respect to Lebesgue measure when 
1/2 <a <1. Wintner [1935] showed that if a is of the form (1/2)'/* where 
k € {1,2,3,...}, then the distribution function is absolutely continuous. 
However, in the other direction, Erdés [1939] showed that when a is the 
positive solution of the Equation a? + a —1 = 0, so that a = (V5 — 1)/2, 
then a€ S,. 

Erdos also showed that S; M(€,1) has zero Lebesgue measure for some 
€ < 1, so that absolute continuity of the invariant distribution obtains for 
(almost every) a sufficiently close to 1. A conjecture that emerged from 
these findings is that the set S, itself should have Lebesgue measure zero. 
In their brief discussion of this problem, [12] state that deciding whether 
the invariant distribution is singular or absolutely continuous for a > 1/2 
is a “famous open question”. 

Solomyak [1995] made a real breakthrough when he showed that S, has 
zero Lebesgue measure. More precisely, he established that for almost every 
a € (1/2, 1), the distribution has density in L?(R) and for almost every 
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a € (2~-'/?,1) the density is bounded and continuous. A simpler proof of 
the same result was subsequently presented by [25]. 

More recent contributions to this literature deal with the asymmetric 
case p # 1/2. (see, for example, [26]). For example, F, is singular for 
values of parameters (a, p) such that 0 < a < p?(1— p)“~?), while F, is 
absolutely continuous for almost every p?(1 — p)“"-?) < a < 1 whenever 
1/3 < p < 2/3. For more details see [21]. 
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10.1. Introduction 


In this lecture we shall present a brief account of some of the interesting 
developments in recent years on quantum information theory. The text 
is divided into three sections. The first section is devoted to fundamental 
concepts like events, observables, states, measurements and transformations 
of states in finite level quantum systems. the second section gives a quick 
account of the theory of error correcting quantum codes. The last section 
deals with the theory of testing quantum hypotheses and the quantum 
version of Shannon’s coding theorem for classical-quantum channels. I have 
liberally used the books by Nielson and Chuang ({12]), Hayashi ([6]) and 
also [21], [23]. The emphasis is on the description of results and giving a 
broad perspective. For proofs and detailed bibliography we refer to [12], [6]. 


10.2. Elements of Finite Dimensional Quantum Probability 


The first step in entering the territory of quantum information theory is an 
acquaintance with quantum probability where the fundamental notions of 
events, random variables, probability distributions and measurements are 
formulated in the language of operators in a Hilbert space. In the present 
exposition all the Hilbert spaces will be complex and finite dimensional and 
their scalar products will be expressed in the Dirac notation (-|-). For any 
Hilbert space H, denote B(H) the *-algebra of all operators on H where, 
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for any operator A, its adjoint will be denoted by At. We write 
(u|A|v) = (u|Av) = (Atulv) Vu,u EH, Ae B(H). 


B(H) itself will also be viewed as a Hilbert space, in its own right, with 
the scalar product 


(A|B)=TrA'TB V A,Be B(H). 


In the real linear subspace of all hermitian operators we write A > B if 
A-—B is nonnegative definite, i.e., A— B > 0. It is clear that > is a partial 
ordering. By a projection we shall always mean an orthogonal projection 
operator. Denote by O(H) and P(H) respectively the space of all hermitian 
and projection operators. Any element A in O(H) is called an observable 
and any element E in P(H) an event. The elements 0 and J in P(H) are 
called the null and certain events. Clearly, 0 < E < IJ for any E in P(H). 
If E is an event, I — E denoted E+ is the event called not E. For two 
events EL, F let E A F and EV F denote respectively the maximum and 
minimum of the pair E, F' in the ordering < . If E < F we say that the 
event E implies F. If E,F are arbitrary events we interpret E V F as E 
or F and EA F as E and F. It is important to note that for three events 
E, Fy, Ez the event EA (FE; V E2) may differ from (EF A E,) V (EA E2). (In 
the quantum world the operation ‘and’ need not distribute with ‘or’ but in 
the logic of propositions that we prove about a quantum system the logical 
operation ‘and’ does distribute with the logical operation ‘or’!) 

For any observable X € O(H) and any real-valued function f on R, 
the observable described by the hermitian operator f(X) is understood as 
the function f of the observable X. If E C R and 1g denotes the indicator 
function of E then 1g(X) is a projection which is to be interpreted as the 
event that the value of the observable X lies in EF. Thus, for a singleton 
set {A} C R, 1,,}(X) 4 0 if and only if A is an eigenvalue of X and the 
corresponding eigenprojection 1,,}(X) is the event that X assumes the 
value A. This shows that the values of the observable X constitute the 
spectrum of X, denoted by spec X and 


X= DP AMpy(X) 


dE Spec X 


is the spectral decomposition of X. We have 
P(H) Cc O(H) Cc B(H) 
and any EF € P(H) is a {0, 1}-valued observable. 
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Any nonnegative operator p > 0 of unit trace on H is called a state of 
the quantum system described by H. The set of all such states is denoted 
by S(H). Then S(H) is a compact convex set in B(H) with its extreme 
points one dimensional projections. One dimensional projections are called 
pure states and thanks to spectral theorem any state can be expressed as 


p= dP uj) (us| (10.1) 


where {u,} is an orthonormal set of vectors in H, pj > OV j and}? p; = 1. 
In other words every stat e can be expressed as a convex combination of 
pure states which are mutually orthogonal one dimensional projections. For 
any X € O(H), p € S(H) the real scalar Tr pX is called the expectation 
of the observable X in the state p. If X = E € P(H) its expectation 
Tr pE is the probability of the event F in the state p. Thus ‘tracing out’ in 
quantum probability is the analogue of ‘integration’ in classical probability. 
Note that 


S(H) Cc O(H) C B(H). 


In a quantum system described by the pair (H,) where p is a state in 
H, the observable p or its equivalent — log p is a fundamental observable 
in quantum information theory and its expectation S(p) = —Trp log p is 
called the von Neumann entropy or, simply, entropy of p. Even though 
—log p can take infinite values S(p) is finite. 

It is often useful to consider the linear functional X — Tr pX on B(H) 
for any p in S(H) and call it the expectation map in the state p. In other 
words we can view expectation as a nonnegative linear functional on the 
C* algebra B(H) satisfying 


Tr pX'X >0V X € B(H), 
Treat = 1, 
If u,v € H we say that |v) € H and call it a ket vector whereas (u| € H*, 
the dual of H and call it a bra vector. The bra vector (u| evaluated at the 


ket vector |v) is the bracket (u|v), the scalar product between u and v in 
H. Thus for u, v in H we can define the operator |u)(v| by 


[u)(v| [w) = (vw) fu) Vw eH. 
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If u #0, v £0 then |u)(v| is a rank one operator with range C|u). In 
particular, 


|u1)(u2| lus) (wal -- + |[u2541)(uaz+2l 
= ((uz|ug)(walus) «++ (wag leoj+1)) ur) (uag+2l, (Ju) (wl) = |v) (ul, 
Tr |u)(v| = (vu). 


Equation (10.1) anticipates this notation developed by Dirac. 

A finite level quantum system A with a state p“ in the Hilbert space 
HA is described by the triple (ey PHA), p*) where the probability of any 
event E € P(H4) is given by Tr p“ EL. If a finite classical probability space is 
described by the sample space 24 = {1,2,...,N}, the Boolean algebra F a 
of all subsets of 04 and a probability distribution P4 then the probability 
space (OF FF) can also be described by the triple GA PCR) 2") 
where H4 = C” and the state operator p“ is given by the diagonal matrix 


PAI) 0 O-+ 0 


0 0 0 .--: PA(n) 


in the canonical orthonormal basis {e;}, e; being the vector with 1 in the 
jth position and 0 in all the other positions. 

If A;,i = 1,2,...,k are k quantum systems and the Hilbert space H“* is 
the one associated with A; then the Hilbert space for the joint or composite 


system A, A... Ax is the tensor product: 
HArA2. Ak ~ HA ® HA2 Q-+@ HAk. 
Suppose 
t= On 


is the Hilbert space of the composite system AB consisting of two sub- 
systems A,B where {e;}, {f;} are some orthonormal bases in H4, H® 
respectively. Let X be an operator on H4®. For u,v € H4, u’,v’ € H® 
define the sesquilinear forms 


Bx (u,v) = Sou @fi|Xlvefi), uvEeH, 


j 
By (u',v') = (ei Qu |X\le,@v'), wv EHF. 


u 
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Then there exist operators T € B(H*), T’ € B(H®) such that 
(u|T|v) = Bx(u,v), (w|T"\v’) = By (u',v’). 


Then T and T” are respectively the relative trace of X over H? and H4 
and one writes 


T=TryeX, TT! = TryaX. 


The relative traces T, 7’ are independent of the choice of the orthonor- 
mal bases {e;}, {fj}. The linear maps X — TryeX, X — TryaX from 
B(HA®) onto B(H4), B(H®) respectively are completely positive in the 
sense that the map 


[Xj]  [Trye (Xy)], 1,9 € {1,2,...,d}, Xa € B(HA) 


from B ie ® iC) into B Ge @ Cc) is positive for every d = 1,2,.... 
Furthermore for all Y in B (HA) and X in B ie ) 


Trae (Y @1)X =YVTry2X, 
Tre X (Y @1) = (Try X)Y, 
Tr (TrzeX) = Tr(Try4X) = TrX. 


In other words, if tracing is viewed as integration the first two relations 
exhibit the properties of conditional expectation and the last one is analo- 
gous to Fubini’s theorem. If p4? is a state of the composite system then 
Tryp? = p4 and Trz4p4? = p® are states on H4 and H®? respectively. 
We call p4 and p? the marginal states of p4?. 

In the context of composite systems there arises the following natural 
question. Suppose p; is a state in the Hilbert space H;, 7 = 1,2. Denote by 
C(p1, p2) the compact convex set of all states in H;®H2 whose H;-marginal 
is p; for each 7. It is desirable to have a good description of the extreme 
points of C(p1, p2). As p varies over the set of all such extreme points what 
is the range of its von Neumann entropy? An answer to this question will 
throw much light on the entanglement between quantum systems (71, p1) 
and (He, p2). When H; = C? and p; = 41 for 1 = 1,2, it is known that 
the extreme points are all pure and hence have zero entropy. They are the 
famous EPR states named after Einstein, Podolskii and Rosen. (See [5], 
[20], [22].) In general, there can exist extremal states which are not pure. 

We now introduce the notion of measurement for finite level quantum 
systems. They play an important role in formulating the problems of quan- 
tum information theory and testing multiple hypotheses. Let H be the 
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Hilbert space of a finite level system. By a measurement with values in an 
abstract set {m1,m2,...,mz} we mean a partition of J into k nonnegative 
operators Mj, Mo,..., Mz so that My + M2+---+M, =I and in any state 
p the values m; occur with respective probabilities Tr pM;, 7 = 1,2,...,k. 
We express such a measurement by 


i Mo+**Mp 
M = (10.2) 
M, M2---My 
When each M; is a projection it is called a von Neumann measurement. 
Without loss of generality, for the purposes of information theory we 
may identify the value set {m1,mo2,...,mx} with {1,2,...,k}. Denote by 
M(H) the compact convex set of all measurements in with value set 
{1,2,...,k}. We shall now describe the extreme points of M;(H) in a 
concrete form. The following is a sharpening of Naimark’s theorem ([16]) 
for finite dimensional Hilbert spaces. 


Theorem 10.1. Let M € M;(H) be a measurement given by 
1 Q--+k 
M = 
M, Mo---M, 


Mi =) > |uig)(uig]| 1S i <k, 


j=l 


where 


r; > 0 is the rank of M;, {uij, 1 < 7 < ri} is an orthogonal set of vectors 
and |\ui;||?, 1 <j <7; are the nonzero eigenvalues of M; with multiplicity 
included. Then there exists a Hilbert space K, a set {Pi, Po,...,Pr} of 
projections in H@®K and vectors viz EK, 1S ji < 7,1 <i<k satisfying 
the following: 
k 
(i) dimHOK= NS ni; 
i=1 
(ii) The set {wiz @ vig] 1 <7 <1i,1 <i <k} ts an orthonormal basis for 
H ® K; 
(iii) P; = SO5L 1 lusg © vig) (uaz © vig | 
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and rank P; = rank M; V1; 

(iv) The triplet consisting of K, the family {uij|1 <j <i, 1<i<k} of 
vectors in K and the spectral resolution P,, Po,..., Py is unique upto 
a Hilbert space isomorphism; 

(v) M is an extreme point of M(H) if and only if the set 
{|usj) (us| [LS ji <ril <j <i, 1<i<k} of rank one operators 
is linearly independent in B(H). In particular, von Neumann measure- 


ments are extremal. 
Remark: 


(1) If one of the M;’s is zero, the test for extremality reduces to the case 
of Mp_1 : 

(2) If M is an extremal element of M; where rank Mj =7r; >OV1<i<k 
then r? +73 +---+72 < (dim H)?. 

(3) If H=C?, w = exp 27i/3 then 


1 2 3 
3/4] 1 3/0 1 Sig" 1 


is an extremal measurement in M3(H) but not a von Neumann mea- 


surement. 
(4) It would be interesting to make a finer classification of extremal mea- 
surements modulo permutations of the set {1,2,...,k} and conjugation 


by unitary operators in H. 


Evolutions of quantum states in time due to inherent dynamics which 
may or may not be reversible and measurements made on the quantum 
system bring about changes in the state. Reversible changes are of the form : 
p — UpU~! where U is a unitary or antiunitary operator. According to the 
theory proposed by Kraus ({11]), (see also [24] and [14]), the most relevant 
transformations of states in a finite dimensional Hilbert space assume the 
form 


-LipL! 
T(p) = 2 ipl € S(H) 
Tre yagi ly 


where {Lj, L2,...} is a finite set of operators in H. Expressed in this form 
T is nonlinear in p owing to the scalar factor in the denominator. If the 
L;’s satisfy the condition 7; UL, = / then T is linear and therefore T de- 
fines an affine map on the convex set S(#). Indeed, T is a trace preserving 
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and completely positive linear map on B(H) when extended in the obvi- 
ous manner. Such completely positive maps constitute a compact convex 
set F(H) C B(H). Elements of F(H) are the natural quantum probabilis- 
tic analogues of stochastic matrices or transition probability operators of 
Markov chains in classical probability theory. Note that F(H) is also a 
semigroup. One would like to have a detailed theory of F(H) both from 
the points of view of convexity structure as well as semigroup structure. 


Theorem 10.2. Let T © F(H) be of the form 


k 
T(p) = >_ LipL}, p € S(H) 


eal 


where ae UL, =I. then T is an extreme point of F(H) if and only if 
the family {LIL,, 1<i<k,l1<j<k} of operators is linearly independent 
in B(H). 


Remark: For any T € F(H) it is possible to find a linearly independent 
set {L;} of operators in B(H) satisfying the relation }°, LiL; =I and 


T(p) = LipL} V pé€S(H). 


a 


If {M;} is another finite set of linearly independent operators satisfying 
these conditions then there exists a unitary matrix ((u;;)) satisfying 


M; = So wig L; VY 4 
J 


Given two elements T;, 72 € F(H) it will be interest to know when they 
are equivalent modulo unitary conjugations. 

As far as the semigroup F(H) is concerned, we make the following obser- 
vation. If U is a unitary operator then the map p — UpU' is an invertible el- 
ement of F(H). If P is a projection then the map p — PpP+(1—P)p(1—P) 
is an irreversible element of F(H). Both of these maps preserve not only 
the trace but also the identity. Such maps are called bistochastic and they 
constitute a subsemigroup F;(71). Do these elementary bistochastic maps 
described above generate B(H)? 
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Theorem 10.3. Let T © F(H). Then there exists an ancillary Hilbert 
space h, a unitary operator U in H®h and a pure state |Q)(Q| in h with 
the property 


T(p) = TryU(p @ |Q)(Q|)UT V pe SH). 


Remark: Theorem 10.3 has the interpretation that the irreversible dynam- 
ics described by the trace-preserving completely positive map can always 
be viewed as a ‘coarse-graining’ of a finer reversible unitary evolution in an 
enlarged quantum system. When the single 7’ is replaced by a continuous 
one parameter semigroup {7;,¢ > 0} it is possible to construct a unitary 
evolution {U;} in a larger Hilbert space which is described by quantum 
stochastic differential equations. Such semigroups are analogues of classi- 
cal semigroups and their study leads to a rich theory of quantum Markov 
processes. (See [14], [18].) 


10.3. Quantum Error-Correcting Codes 


The mathematical theory of quantum error-correcting codes is based on the 
following assumptions: (1) Messages can be encoded as states of a finite 
level quantum system and transmitted through a quantum communication 
channel; (2) For a given input state of the channel the output state can 
differ from the input state owing to the presence of ‘noise’ in the channel. 
Repeated transmission of the same input state can result in different output 
states; (3) There exists a collection of ‘good’ states which, when transmit- 
ted through the noisy channel, lead to output states from which the input 
can be recovered with no error or a small margin of error. The main goal 
is to identify a reasonably large collection of such good states for a given 
model of the channel and construct the decoding or recovery procedure. 
This can be described in the following pictorial form: 


input message——>— encoder _|——> channel > decoder ;}—->———- output message 
input state (p) output state T(p) 


noise 


Fig. 10.1. Encoding, transmission and decoding. 
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Denote by # the finite dimensional Hilbert space from which the input 
and output states of the channel appear in the communication system. We 
assume that there is a linear subspace € C B(H), called the error space 
such that for any input state p of the channel the output state T(p) has 
the form 


YL; pL 
T(p) = — 


p Trp LiL,’ ieee Ws (10.3) 
j 

where the summations are over finite sets of indices. If the same state p 

is transmitted repeatedly the corresponding ‘corrupting’ or ‘noise-creating’ 

operators {L,;} can differ in different transmissions but they always con- 

stitute some finite subsets of €. The L;’s may depend on the input state 

p- 

For any subspace S C H denote by E(S) the projection onto S and 
by S+ the orthogonal complement of S in H. A state p is said to have its 
support contained in the subspace S$ if p|u) = 0 V u in S+. This means 
that we can choose an orthonormal basis {e;, €2,...,en} for H such that 
the subset {e;,€2,...,e%} is an orthonormal basis of S and the matrix of p 


p|\0 
in this basis has the form ar where fp is a nonnegative matrix of order 


k, and unit trace. To recover the input state p from the output state T'(p) 
of the channel we look for a recovery operation R of the form 


R(T(e)) = 7 MT (p)M} 


where {M;} is a finite set of operators depending only on the error space 
E C B(H) describing the noise. Whatever be the output we apply the same 
recovery operation R. The goal is to construct a reasonably large subspace 
C CH and a recovery operation FR satisfying the requirement 


R(T(p)) = p 


for every state p with support in C and every transformation T of the 
form (10.3). In such a case we say that the pair (C, R) is an € -correcting 
quantum code. For such a code denote by E = E(C), the projection onto 
C. It is a theorem of Knill and Laflamme ((12], [21]) that 


ELi'ME=XL'M)E V L,MeEE (10.4) 
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where \(L'M) is a scalar depending on L'M. Conversely, if E is any 
projection in H satisfying (10.4) for all L,M in a subspace € of B(H) 
then there exists a recovery operation R such that the range C of E and 
R constitute an €-correcting quantum code. It is important to note that 
equation (10.4) is sesquilinear in L, M and hence it suffices to verify (10.4) 
for L, M varying over a basis of €. In view of this property we say that the 
projection F satisfying (10.4) or its range may be called an €-correcting 
quantum code. 

In order to construct quantum error-correcting codes it is often useful to 
identify H with L?(A) where A is a finite abelian group of order equal to the 
dimension of H and consider the so-called Weyl operators as an orthogonal 
basis for the Hilbert space B(H). To this end we view A as an additive 
abelian group with null element 0 and addition operation +. Denote by 
|a) the ket vector equal to the singleton indicator function 1,,} for every 
a € A. Then {|a) , a € A} is an orthonormal basis for # = L?(A). Choose 
and fix a symmetric nondegenerate bicharacter (-,-) on A x A satisfying 


(i) (x,y) = (y, x) and |(z,y)| =1V 2,y € A; 
(ii) (x, y1 + y2) = (2, y1) (2, yo) V 2, Y1, Yo € A; 
(iii) (7, y) =1V ye A if and only if « = 0. 


Such a choice is always possible. It is then clear there exist unitary 
operators U,, Va, a € A satisfying 


U,|x) =|e%+a), Valz) =(a,z)|z) V wea. 
They satisfy the following relations: 
UUs = Ua+; VaVe = Va+bs ViUa = (b, a)UaVp, a,b é A. 


These are analogous to the famous Weyl commutation relations of quan- 
tum theory for the unitary groups U, = e~*“4, V, = e~**?, a € R where 
p and q are the momentum and position operators. Here the real line is 
replaced by a finite abelian group alphabet A with U, as a location and V, 
as a phase operator. We write 


W(a,b)=UaVp, (a,b)E AXA 
and call them Weyl operators. They satisfy the relations 


Wa, b)W(a', b’) = (b, a’')W(a gO b’), 
W (a, b)W (a, y)W(a,b)~* = (b,x) (a, y)W (a, y). 
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We write 


(((a, b), (x, y))) — (b,x) (a, y) V (a,b), (x,y) in Ax A 


and call ((-,-)) the symplectic bicharacter on A x A. Two Weyl op- 
erators W(a,b) and W(az,y) commute with each other if and only if 
(((a, b), (z,y))) = 1. If S C Ax A is a subgroup and (((a, b), (a’, b’))) = 1 
for all (a,b), (a’,b’) in S then it is a theorem that there exists a function 
w:5 — T, T denoting the multiplicative group of complex scalars of mod- 
ulus unity, such that the correspondence 


(a, b) — w(a, b)W(a, b) 


is a unitary representation of S. If x is a character on such a subgroup S 
define the projection 


Ex(S)==> Y= x(a,b)w(a,b)W(a,b), x ES, (10.5) 
(a,b)ES 


S denoting the character group of S. Subgroups S C A x A of the type 
introduced above are called selforthogonal or Gottesman groups (See [3]. 
[1].) Denote 


S~ = {(x,y) |(((2,y), (a,b))) =1 V (a,b)eS}CAxA. — (10.6) 


S+ is asubgroup of A x A and S+ 5 S. It may be called the symplectic 
annihilator of S. Note that 


0 if (a,b) £ (0,0) 
dim H otherwise. 


Tr (a,b) ={ 


This shows that { (dim H)~2W(a,y), (a, y) € Ax A} is an orthonormal 
basis for the Hilbert space B(7) introduced earlier. In particular, the Weyl 
operators consitute an irreducible family and every operator X on H admits 
a ‘Fourier expansion’ 


1 
tT 
A= dim 1 y {Tr W (a, b) xX} W(a,b), X € B(H). (10.7) 
(a,b)EAXA 


Elementary algebra using Schur orthogonality relations for characters 
shows that 


_ 0 if (x,y) ¢S* 
Ey (S)W(2, y)Ey(S) = ee if (x,y) ES. 
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where A(x, y) is a scalar. In other words the linear space 
D(S) = lin span {W(z,y),(v,y) € SU(A x A\S*)} (10.8) 
satisfies the property 
By (S)LE,(S) =XA(L)E,(S) Vo Le D(S). 


Clearly, D(S') is closed under the adjoint operation. Suppose now that 
E Cc B(H) is a subspace satisfying the property 


Efe ={LiIM|L,MeEE} CD(S). 


Then (10.4) holds when E = F,(S) and by Knill-Laflamme theorem 
the range of FE, (S) is an €-correcting quantum code for any Gottesman 
subgroup S and character x of S. Furthermore, 


EY (S)Ey(S)=0 if x,x’€ S and y4x’. 
» Bis) =. 


If C,(S) denotes the range of EF, (S) it follows that 7 decomposes into a 
direct sum of €-correcting or D(S)-detecting quantum codes of dimension 
dim H\#S. 

We now discuss the special case when # is replaced by its n- fold tensor 
product H®” = L?(A)®" = L?(A"). Denoting any point x € A” by x = 
(%1,%2,...,%,) where x; € A is the j-th coordinate we get a symmetric 
nondegenerate bicharacter on A” x A” by putting 


(x,y) = | | (3,4) 
j=l 


and the corresponding Weyl operators 


where W(x;,y;) is the Weyl operator in L?(A) associated with (x;,y;) for 
each j. Let Eq denote the linear span of all operators in H®” of the form 
X1 ® X2 ®---@® Xy, where X; is different from J for at most d values of 7. 
Then €q has the orthogonal unitary operator basis 


{W (a, b) |# {2 |(ai, bi) F (0,0) } < a}. 
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One says that the element (a,b) in A x A has weight 


w(a,b) = # {é|(ai,b:) # (0,0)}. 


Let S Cc A” x A” be a Gottesman subgroup of A” x A” such that 
every element (a,b) in $+\S has weight > d. Then it follows from the 
preceding discussions and the definition in (10.8) that €,(S), i.e., the range 
of F,(S) is an &-correcting quantum code where t = | + |. It is also 
called an €g-detecting quantum code. Thus the problem of constructing Eq- 
detecting quantum codes reduces to the algebraic problem of constructing 
symplectic self-orthogonal or Gottesman subgroups S of A” x A” satisfying 
the property that every element (a,b) in S+\S has weight > d. 

Choosing A = Zz the additive group of the field GF(2) the problem 
of constructing symplectic selforthogonal subgroups of A” x A” has been 
reduced to the construction of classical error correcting codes over GF'(4) 
by [3]. See also [1], [21]. 

We conclude this section with an example of a quantum code based 
on a Gottesman subgroup which also yields states exhibiting maximal en- 
tanglement for multipartite quantum systems. To this end, we introduce 
in L?(A°) a single error correcting, (i.e., €j-correcting) quantum code as 
follows. Introduce the cyclic permutation o in A® by 


(2515.05, Waa) = (01, 2a, WHau, VO; ) 
and put 


T(x) =0?(x) +0 7(x), x= (x0,21,22,%3,24) € A”. 


Wa = (x,0°(x))W(x, 7(x)), x € A’. 
Consider the subgroup C C A® given by 
C= {x|to +21 + to+ 23 +04 = 0}. 


Then C is a Gottesman subgroup and x > Wx is a unitary representa- 
tion of C and the operator 


E(C) = (#4)"* > W, 


xEC 


is a projection on the subspace 


C= {b|Wab =¥ v xech. 
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This subspace is an &)-correcting quantum code of dimension #4A. It 
has the maximum entanglement property in the sense that for any w € C 
the pure state |~)(~| satisfies the relation 

Treacas) Wl = 
Ip2(A3 =, 
(A) (#A)? 
for any three copies of A occurring in A°. Here we view L?(A°) as L?(A) @ 
--- @ L?(A) with 5 copies and the factorization 


P(A’) =F (4) er (A"). 


For more details see [19], [22]. It is an interesting problem to construct 
such examples for products of n copies of A with n > 5 and construct 
subspaces of the form C with maximal dimension. It is desirable to have a 
formula for this maximal dimension. 


10.4. Testing Quantum Hypotheses 


Suppose a quantum system with Hilbert space 1 is known to be in one of 
the states {p1, P2,..., x} and we have to decide the true state by making a 
L 2 woe 
M, Mo aes Mp 
set {1,2,...,k} and applying the decision rule that the state of the system 


measurement M = ( ) as described in (10.2) with the value 


is p; if the outcome of the measurement is 7. In such a case, if the true 
state is p; then the probability of deciding p; is equal to Tr p; M;. Suppose 
there is a prior distribution 7 = (71, 72,...,7%) on the set of states, i.e., 
the system is in the state p; with probability 7; for each 7 and there is cost 
matrix ((c;;)) according to which c;; is the cost of deciding p; when the true 
state p;. In such a case the expected cost of the decision rule associated 
with the measurement is 


C(M) = > (1; Tr Pi M;) Cij- 
tJ 
The natural thing to do is to choose a measurement M which minimizes 
the cost C(M). Define 


n= f M 
In C( ) 
and the hermitian operators 


Aj = S| Micig pi. (10.9) 
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Then 
k 
v= inf Tr > AIM). (10.10) 
il 
Since the space of all measurements M with value set {1,2,...,k} con- 


stitute the compact convex set M;,(H), 7 is attained at a measurement M° 
which is also an extreme point of M;(H). An appropriate application of 
Lagrange multipliers shows that the following theorem holds. 


cf Lb 2 sea B 
Theorem 10.4. (cf. [26], [9], [16]) Let M° = e Me... i be a 
measurement satisfying 
k 
y= Tr)” A; Mf 
j=l 


where A; and y are given by (10.9) and (10.10) and let T = yee AjM>. 
Then the following holds: 


(i) Pts hermitian and P< Ay V j = 1,2,..0,h3 
(ii) (A; -T)MP=0 V i; 
(iii) [2s independent of the measurement M° at which y is attained. 


Remark: In the context of Theorem 10.4 it is natural to seek a good 
algorithm for evaluating [ when the hermitian operators {A;,1 <j < k} 
are known. One may interpret [ as a ‘minimum’ of the noncommuting 
observables Aj, 7 = 1,2,...,k. 

In the discussion preceding Theorem 10.4 note that 5* 7; Tr p; M; is 


7 

the probability of correct decision. It is natural to consider the problem 
of maximising this probability in the spirit of the method of maximum 
likelihood. Let 

k 

= max Tr DTM 
hs 
As before 6 is attained at some measurement M°. Then one has the 

following theorem. 


1 2 sia Kk 


Theorem 10.5. Let M° = im M3... Mp 


) be a measurement 
satisfying 


o= Tr S— mipiM? 


Quantum Information Theory 241 


and let A= )>7p;M?. Then the following holds: 


(i) A is hermitian and A> nip, V 3; 
(ii) (A 7 mpi) M? =) OP 
(iii) A is independent of M° at which 6 is attained. 


Now consider the situation when n copies of the system are available so 
that p; can be replaced by p? for each i in H®”. Write 


k 
On = max Tr > mer M; 
i=1 


where M € M;(H®") and A, for the corresponding operator in Theo- 
rem 10.5. Then the probability of making a wrong decision is equal to 1—4d,,. 
By the elementary arguments in [17] it follows that 1 —6,, decreases to zero 


as n increases to oo and the rate of decrease is exponential. It is an open 
— log(1—46,,) — log(1—6,,) 
n n . 


and lim 
n— CO 


problem to determine the quantities lim 


n— co 
When k = 2 so that there are only two states the following theorem holds: 


Theorem 10.6. (Quantum Chernoff Bound, see [2], [13]) Let pi, p2 be 
two states and let 7,72 be a prior probability distribution on {1,2} with 
0 <7, <1. Suppose 


— max Tr (mo? Mi + 2p" Ma) 


where 


1 2 
M= 
& i) 


varies over all measurements in H®” with value set {1,2}. Then 


1 
lim — —log(1—6,) = sup —log Trp%p,*. 
noo 1 0<s<1 


Remark: An explicit computation shows that in Theorem 10.6 
1 2” 2” 
dn = 5 (1+ |roP” — 7208"|, ) 


where | - |; stands for trace norm. 
In the two states case we introduce the quantity 


G(p1, p2,€) = inf {Tr poX |O< X <1,TrpxX >1-¢} 
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for any 0 < ¢ < 1 and states pj, po. It has the following statistical interpre- 
tation. For any 0 < X < J consider the measurement 


1 2 
M=( 1 


with value set {1,2}. For the decision rule based on M, Tr p; X is the prob- 
ability of correct decision in the state p;. we vary M so that this probability 
is kept above 1 — ¢. Now we look at the probability of a wrong decision in 
the state p2 and note that the minimum possible value for this probabil- 
ity is 3(p1, p2,€). In the Neyman-Pearson theory of testing hypothesis the 
quantity 1 — G(p1, p2,€) is called the maximum possible power at a level of 
significance above 1 — ec. Now we state a fundamental theorem concerning 


B (68", p2",2) due to [8]. 


Theorem 10.7. (Quantum Stein’s Lemma, see [6]) 


log 8 (o?" ee e) 
lim 


n— oo nm 


= Tr py (log pi — log pa). 


Remark: The right hand side of the relation in Theorem 10.7 is known as 
the relative entropy of pz with respect to p; or Kulback-Leibler divergence 
of p2 from p; and denoted as S(p1||p2). 

We shall conclude with a brief description of a quantum version of Shan- 
non’s coding theorem for classical-quantum communication channels. A 
classical-quantum or, simply, a cq-channel C consists of a finite set, called 
the input alphabet A and a set {p,, « € A} of states in a Hilbert space 1H. 
A quantum code of size m and error probability < € consists of a subset 
C c A of cardinality m and a C-valued measurement M(C) = {M,, x € C} 
with the property Tr p;M, >1—eV a e€C. If we have such a code then m 
classical messages can be encoded as states {p,,xz € C} and the measure- 
ment M(C) yields a decoding rule : if « € C is the value yielded by the 
measurement M(C) decide that the message corresponding to the state p, 
was transmitted. Then the probability of a wrong decoding does not exceed 
€. In view of this useful property the following quantity 


NiC,2) = max{m| a quantum code of size m and error probability < ¢€ exists} 


is an important parameter concerning the cq-channel C. 
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If C is a cq-channel with input alphabet A and states {p,,x2 € A} in H 
we can define the n-fold product C®” of C by choosing the alphabet A” and 
the states 


{ Pe; @ Pay @ +? @ Py, |X = Wis WSs 2x54) CA, 4 EA V i}. 
Define 
N(n,e) =e?) 


Then we have the following quantum version of Shannon’s coding 
theorem. 


Theorem 10.8. (Shannon’s Coding Theorem, see [6], [23], [25}) 


fin, SES sup ‘5 (= p(x) | — > p(z) sien} (10.11) 


_—— if P(:) reEA rEA 


where the supremum on the right hand side is taken over all probability 
distributions {p(x),x € A} and S(p) stands for the von Neumann entropy 
of the state p. 


Remark: The quantity on the right hand side of (10.11) is usually denoted 
by C = C(C) called the Shannon-Holevo capacity of the cq-channel C. 
It is interesting to note that the expression within the supermum on the 
right hand side of (10.11) can be expressed as the mean relative entropy 


dizea P(x) S (px||P) where p denotes the state )),, p(x) pr. 
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Scaling Limits 
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11.1. Introduction 


The simplest example of a scaling limit, is to take a sequence {X;} of 
independent random variables, with each X; taking values {+1,—1} with 
probability {p, 1 — p} respectively. Let 


Sp = Xi + Xo+-+++ Xp. 


If m = 2p — 1, by the law of large numbers 

1 

— Sing] — mt. 

n 
Here the rescaling is = = and t = & and the limit is non-random. If 
m= 0, Le: p= 5, the convergence of the random walk, with a different 


scaling, is to Brownian motion. According to the invariance principle of 
Donsker ({1]), the distribution of 


converges to that of Brownian motion. Here the rescaling is different, 7 = 
= rescales space and t = ~ rescales time. In the new time scale the exact 
position of the particle in the microscopic location is not tracked. Only 
the motion of its macroscopic location x(t) which changes in a reasonable 
manner in the macroscopic time scale with a non trivial random motion is 
tracked. This random motion is the Brownian motion. 
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Many problems in the physical sciences involve modeling a large system 
at a small scale while observations and predictions are made at a much 
larger scale. Of particular interest are dynamical models involving two 
distinct scales, like the random walk model mentioned earlier, a microscopic 
one and a macroscopic one. The system and its evolution are specified at 
the microscopic level. For instance the system could be particles in space 
that interact with each other and move according to specific laws, that can 
be either deterministic or random. The system can have many conserved 
quantities. For instance in the classical case of Hamiltonian dynamics, mass, 
momentum and energy are conserved in the interaction. In the simplest case 
of interacting particle systems, in the absence of creation or annihilation 
the number of particles is always conserved. 


Usually there are invariant probability measures for the complex system, 
which may not be unique due to the presence of conserved quantities. They 
represent the statistics of the system in a steady state. If the evolution is 
random and is prescribed by a Markov process, then these are the invariant 
measures. If we fix the size of the system there will be a family of invariant 
measures corresponding to the different values of the conserved quantities. 
When the size of the system becomes infinite, there will be limiting mea- 
sures, on infinite systems and a family of invariant probability measures 
indexed by the average values of the conserved quantities. In the theory of 
equilibrium statistical mechanics, these topics are explored in detail ([2]). 


11.2. Non-Interacting Case 


Let us illustrate this by the most elementary of non interacting random 
walks. We have a large number n of independent random walks, with 
p = 4. The walks {S! :1<j <n} start from locations {zJ} at time k = 0. 
Their locations at time k are S/. We assume that initially there exists a 
function po(x) > 0 on R with | ae po(x)dx = p < ov, such that 


tim = SHS) = fs (e) pole) ae (2.1) 


noo 1 


for every bounded continuous function f on R. If we look at it from a 
distance we do not see the individual particles, but only a cloud, and the 
density varies over the location x in the macroscopic scale. The statistical 
nature of the actual locations can be quite general. We could place them 
deterministically in some systematic way to achieve the requisite density 
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profile, or place them randomly with the number of particles at site z having 
a Poisson distribution with expectation 


z+ 2n 
pn) =H / po(e) ae 


and independently so for different sites. The relation (2.1) is now a conse- 
quence of the ergodic theorem. If (0, z) is the number of particles initially 
at site z, then (2.1) can be rewritten as 


Jim ID (0, 2) =-[ x f(a) po(a (2.2) 


If the initial distributions of number of particles at different sites are Poisson 
with the same constant parameter A, and they are independent then this 
distribution is preserved and the same holds at any future time. For any 
constant A the product measure Py of independent Poissons at different sites 
with the same parameter ) is invariant. But this corresponds to po(x) = X. 
If the system evolves from an arbitrary initial state and and if we look at 


it at time nt, since the rescaling is with n in space and n? in time, we are 
working with the Brownian or central limit scaling and 
1 (2-29)? 


j olen fs 
Pljn2y = 2] ~~ erie ane 


and the combination of a local limit theorem and the Poisson limit theorem 
implies that the distribution of (n7t, z), in the limit, as 2 — @, is Poisson 
with parameter 

_ (z=2')? _ (y=2)? 


1 
= |i anze 7(0, z') 2t dy. 
)= tim 2h n = po(y)dy 


In other words we see locally an invariant distribution, but with a parameter 
that changes with macroscopic location (t, 2) in space and time. If we look 
at the picture through a microscope, at the grain level we see the Poisson 
distribution, with a parameter determined by the location. It is not hard 
to prove a law of large numbers, 


im ee n(n7t, z) = [- i (x) p(@, 2) da. (23) 


These are statements of convergence in probability. Notice that the only 
conserved quantity here is the number of particles. Therefore one expects 
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the quantity of interest to be the density p(t,x). Once we know p(t, x) we 
are aware that locally, microscopically, it looks like Poisson process on the 
integer lattice with constant intensity p(t,z). While in this case we were 
able to solve explicitly, we should note that p(t,z) is the solution of the 
PDE 
> 

ob = TEs (0,2) = pol). (2.4) 
It is more convenient if we make time continuous and have the particles 
execute jumps according to Poisson random times with mean 1. Jump rate 
is 4 for each direction, left or right, for each particle. Speeding up time now 
just changes the rates to - If we think of the whole system as undergoing 
an infinite dimensional Markov process, then the generator £,, acting on a 
function f(7) of the configuration {n(z)} of particle numbers at different 
sites is give by 


(LaF )(m) = S[F(a’) - F(a)lo (77). 
n! 
The new configurations 7’ involved are the results of a single jump form 
z — z+1 which occur at rate o(n,7') = 37(z). In particular with 


we get 
(EnF\(n) = 5 Yo nl) (f(>) - #2) + -) - 1 
~ Sole) f"(=) 


I? 
| OO 
oa, 8 
& 

d 

— 

St 

rs 

Q, 

8 


which, with just a little bit of work, leads to the heat equation (2.4). 


11.3. Simple Exclusion Processes 


Let us now introduce some interaction. The simple exclusion rule imposes 
the limit of one particle per site. If a particle decides to jump and the site 
it chooses to jump to is occupied, the jump can not be completed, and the 
particle waits instead for the next Poisson time. Let us look at the simple 
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case of totally asymmetric walk where the only possible jump is to the next 
site to the right. The generator is given by 


(LaF)(m) = S > n(z)(1 — n(z + 1))F(n***"). 


For any configuration 7 = {n(z)}, the new configuration noe = 
{n?>* (z)} is defined by 


n(2’) if 22" 
no (= hla") fea! 


n(z) otherwise . 


Note that because of exclusion, the number of particles n(z) at z is either 


0 or 1. Now we speed up time by n and rescale space by x = =. With 
1 z 
F(n) = — » (> )n(z) 5 
il z 
(CnF)(n) ~ = So nl2\( = ne + Y)"(Z). (3.1) 


The invariant distributions in this case are Bernoulli product measures, with 
density p. That plays a role because we need to average n(z)(1 — n(z + 1)). 
While averages of 7(z) over long blocks will only change slowly due to 
the conservation of the number of particles, quantities like n(z)(1 — n(z + 
1)) fluctuate rapidly and their averages are computed under the relevant 
invariant distribution. The average of n(z)(1—7(z+1)) is p(1—p). Equation 
(3.1) now leads to 


5 | Fe olt2)ae = fF e)plt,a)(0 — plt,))de 


which is the weak form of Burgers equation 


ao + = [olt,2)(1 = p(t,2))] =0. (3.2) 


Weak solutions of this equation are not unique, but the limit can be char- 
acterized as the unique solution that satisfies certain ” entropy conditions” . 
We have used the independence provided by the Bernoulli distribution. 
What is really needed is that averages of the type Tear Vizi<e MZ) — 
n(z+1)) over long blocks can be determined with high probability as being 
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nearly equal to 7(1—7) with 7 = Seat Ae n(z). See [3] for a discussion 
of the model and additional references. 

We will now consider a general class of simple exclusion processes on 
Z“%, where the jump rate from z to z’ is p(z’ — z). The site still needs to be 
free if a jump is to be executed. The generator is given by 


(LF)(n) = So n(z)( — n(2’))p(2! — 2) LF (n?* ) — Fm]. 


If the mean )°> zp(z) 4 0, then the situation is not all that different from 
the one dimensional case discussed earlier. We shall now assume that 
>>, 2p(z) = 0. There are two cases. If p(z) = p(—z), then we can sym- 
metrize and rewrite 


(CF )(n) = 5 Sine) = n2')) + (21 = nee! - 2)’ - Fn) 


Zize 


= Sople! — 2) Fn") — Fn) 


because 7(z)(1 — n(z’)) + n(z’)(1 — n(2’) = 1 if n(z) ¥ n(z’). Otherwise 
BGP” ) = 20): 

In this case, time is speeded up by a factor of n? and one can calculate 
for 


F(n) =n-4 > f(=)n(2), 


(CnF Yn) = 25 Dole’ - yu) - 1) 
1 : Z 
= 5,4 >» de Mii fia (5) 


== ane). 


Here A is the second order differential operator ty, j a;,;D;D;, and A= 
{a;,;} is the covariance matrix 


aig =) (apeia,e;)ple) 


z 
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It is not very hard to deduce that if initially one has the convergence of the 
empirical distributions 


1 
Yn = | » 627(z) 


in the weak topology, to a deterministic limit 


v(dx) = po(x)dx 


then the empirical distribution 


rn(t) == D-denlt,2) 


in the speeded up time scale, (i.e t is really n?¢ ) will converge in probability 
to p(t, x)dx, where p(t, x) solves the heat equation 


O 1 
a 9 > | 44,3 DiD; Pp; p(0, x) = po(x). (3.3) 
i,j 


If we drop the assumption of symmetry and only assume that >>, zp(z) = 0, 
then we have a serious problem. The summation 


/ 


n? z 
(Ln F)(n) = 5 do n(z)(1 = n(2"))p(2" - 2b Gore ee 


232! 


)| 


2 
n 
does not simplify. The smoothness of f can be used to get rid of one power 
of n by replacing f(2) — f(2) by H((2’— 2), VA(Z) + (WA)(Z))). It we 
denote by 


W (2) = ple) Dale (e!2)ple!=2) 42) mle (e292) 


z! 


then we have 
(Ln F)(n) = <5 ) (Wz), (VEE). 


We can write W(z) as t,W(0) where 7, is translation by z in Z4. W(0) has 
mean 0 with respect to every Bernoulli measure. In some models the mean 
0 function W (0) takes a special form such as }>, cz[7zG(n) — G(n)] for some 
local function G. This allows for another summation by parts which gets 
rid of one more power of n. We end up with an equation of the form 


oy | Fe)elt,) ae = 5 | aD; f)(a)G(olt,2)) de 
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where G (p) is the expectation of G(n) with respect to the invariant 
(Bernoulli) measure with density p. In our case, i.e, symmetric simple 
exclusion, G(7) take the very simple form G(7) = 7(0) and G(p) = p. 
Leads to an . of the form 

Op 

ae 52 ta DD,Glo (t, x)) = 5 Vas DiDjo. 

iJ 

This type of a situation where one can do summation by parts twice are 
referred to as gradient models. Another example of a gradient model is the 
zero range process. Here there can be many particles at site. The sites are 
again some Z?. Each particle if it jumps is likely to jump from z to z’ with 
probability p(z’ — z). However unlike the Poisson case we saw earlier the 
jump rate A(t, z) for any particle at site z depends on the number 7(t, z) of 
particles currently at site z. Here 


= > Xn(z))p(2! — DIF (n*”’) — FM] 


where nee is obtained by moving a particle from z to 2’ in the configuration 
7. With the usual diffusive scaling, if 


F(n) = i S- I(=)n(z) 
then 
(LaF )(m) =n? S° A(n(z))p(z' — 2)[F(n** ) — F(n)| 
= = xnlowle’ — aE) - 4) 
~ sag DW Nele)0le! — 22! — Dale! — 2) (DDN) 


I2 


sca Male) (AAV). 


There is a one parameter family of invariant distributions which are product 
measures with the number & of particles at any site having a distribution 
Vg given by 
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where Z(@) is the normalization constant and @ = 0(p) is adjusted so that 
>=, kvo(k) = p. The expectation A(k) at density p, ie. 


do ACE) Ya() = ACP) 
k 
can be computed and 


(Ln F)(n) ~ ; R(p(a))(Af)(a) de 


leading to the nonlinear PDE, 


2P = AX(o(t,2)) 


with 


1 
Af => 9 dij DiD5f r 
tJ 


See [3] for a detailed exposition of zero-range processes. 


With this kind of limit theorems we can answer questions of certain 
type; if the initial mass distribution is given by po(x) how much of the 
mass will be in a certain set B at a later time t. But we can not answer 
questions of the form how much of the mass that was in certain set A at 
time 0 ended up in a set B at a future time ¢. This requires us to keep 
track of the identity of the particles. While the particles that were not in 
A at time 0 do not count, they do affect the motion of particles starting 
from A. 

A natural question to ask is if we start in equilibrium with density p 
and have one particle at 0, and watch that particle what will its motion 
be? We can expect a diffusive behavior, especially in the symmetric simple 
exclusion model, and the so called tagged particle will exhibit a Brownian 
motion under the standard central limit theorem scaling and the dispersion 
matrix will in general be a function S(p) of p. If p is very small, there are 
very few particles to affect the motion of the tagged particle and one would 
expect S(p) ~ A = {a;,;}. However if p ~ 1, most sites are filled with 
particles and free sites to jump to are hard to find. The tagged particle 
will hardly move and one would expect S(p) ~ 0. The motion of two 
distinct tagged particles can be shown to be asymptotically independent. 
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In equilibrium at density p, the answer to question we asked earlier can be 


of ax | pplt, x, y)dy 
A B 


where p, is the transition probability density for the motion of the tagged 


answered by 


particle at density p. This suggests that if we take A to be all of R%, then 
the density evolves according to the heat equation with dispersion S(p) 
and not A. But this is no contradiction, since we are in equilibrium and 
constants are solutions of every heat equation! The scaling limit of the 
tagged particle can be found in [5]. 


It is natural to ask what the motion of a tagged particle would be in 
non-equilibrium. The particle only interacts with other particles in the 
immediate neighborhood and since the full system is supposed to be locally 
in equilibrium, the tagged particle at time ¢t and position x, will behave 
almost like particle in equilibrium with density p(t,x), ie., a Brownian 
motion with dispersion S(p(t,2)). In other words the tagged particle will 
behave like a diffusion with Kolmogorov backward generator 


1 
L= 5) Sis(0(t,#))DiDj. 
tJ 


May be so, but perhaps there is an additional first order term and the 
generator is 


1 
£L= 5 >> Sig(0lt,2))DiDj + > b;(t,2)D;- 
iJ J 


Since we do not know b, it is more convenient to write the generator in 
divergence form 


L= SVS(olt £))V + b(t, z)-V. 


If this were to describe the motion of a tagged particle then the density r 
of the tagged particle would evolve according to the forward Kolmogorov 
equation, i..e 
Or 
Ot 
But the motion of the particles is the same tagged or otherwise. Hence p 
itself must satisfy the equation 


Op 1 _ il 
BE = VS olt, =) Ve — V - (bp) = SVAN. 


= 5VS(olt,2))Vr — V- (br). 
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Crossing our fingers, and undoing a V, 


_ [Slo(é,2)) ~ Al(Ve)(t,2)- 


b(t 
oe) 2p(t, x) 
It is hard to prove this relationship directly. Instead we look at a system 
where we have particles of k different colors say 7 = 1,2,...,k. The evo- 


lution is color blind and is the symmetric simple exclusion, but we keep 
track of the colors. We try to derive equations for the evolution of the k 
densities p = {p;(t,x)}. We let n;(z) = 1 if the site z has particle with 
color j. 7(z) = }0;7j(z). We denote by ¢ the entire configuration {7;(z)}- 


With 
PO = mil) 


we can compute 


/ 


LaF = 7 0d mi(2)(1 — m2") ~ NE) — fi] 


which is a non-gradient system. We can do one summation by parts and 
we will be left to handle expressions like 


Fad DD (Drfi)(=)Winr(t26) 


where W;,,(¢) is an expression with mean 0 under every invariant measure. 
The invariant measures are product Bernoulli with P[n;(z) = 1] = p;. We 
wish to replace the term W;,,, by 
Wir = D> CH lng (er) — 15 (0)] 

gt at 
and show that the difference is negligible. This is an important and difficult 
step in the analysis of non-gradient models. The negligibility is proved, in 
equilibrium after averaging in space and time. In other words quantities of 
the form 


a / Yo HE) Wn(726(8)) — Wo (r26(8))]ds 


are shown to be negligible in equilibrium. The quantities (Cra) which 
exist can be explicitly calculated as functions of p = {p;}. 


- ows 1- 
Bj, 5 (P) = (0595,5° — 7 S(p) + oe . p) A, 
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Sia. 1 


X59" 0 ——T 
id (P) os fz 


and 
C= By. 


One has to justify substituting for p their local empirical values in non- 
equilibrium. Finally for the densities {p;(t,x)} we get a system of coupled 
partial differential — 


Lf file mn pita) 


or 


5 [ Ofe\CH, (olt,2))(De-oy)(t,2) de 


ee nr 


a 
“A = | go Ons () (p)Vp%. 


The sum p = )/, p; will satisfy the equation 


Op 1 

pao | 

ar vee 

and given p(t,x), each p;(t, x) is seen to be a solution of 

Opj(t, x) (S(o(t, x)) — A)Vplt, x) 
ot 2p(t, x) 

which is the forward equation for the tagged particle motion. These results 

were first obtained by Quastel ([7]). It requires more work to conclude that 


= LVS(o(t,2))Voj(t,2) -V- | pi(t,2) 


the empirical process 
1 
j 


viewed as a random measure on the space w = D[[0, 7]; R@] of trajectories, 
converges in probability, in the topology of weak convergence of measures 
on 2 to the measure Qo, which is a Markov process on R?%, with backward 
generator 


(S(p(t, x)) — A)Vptt, x) 
2p(t, x) 


and initial distribution po(x). p(t, x) itself is the solution of the heat equa- 


5VS(olt,2))V + Vv 


tion (3.3). The proof involves going through the multicolor system, where 
the colors code past history and the number of colors increase as the coding 
gets refined. This was carried out by Rezakhanlou in [9]. 
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11.4. Large Deviations 


We will now move on to discuss the issues of large deviations. Large devia- 
tions arise by changing the rules of the evolution. The amount of change is 
measured in terms of relative entropy, suitably normalized. The effect is to 
produce a different limit. The large deviation rate function is the minimum 
of the relative entropy over all the changes that produce the desired limit. 
Our goal is to determine the rate function J(Q) that the empirical process 
is close to @. We will carry it out in three steps. The process Q has a 
marginal q(t,x). The first step is to determine the rate function for the 
initial configuration q(0,x7). This is some quantity I(q(0,-)). If the initial 
condition is chosen randomly with a site z getting a particle with probabil- 
ity p(+), then the typical profile will be p(x) but we can have any profile 
q(x) with a large deviation rate function equal to 


q(x) 1 — q(z) 
1a) = f (a(e)toe 2 + (1 — a(x) log 2 Nae. 
Although we state the results for the full Z¢ which scales to R%, the results 
are often proved for a large periodic lattice that scales to the torus. But 
we will ignore this fine point. 

The next step is to examine how the density of the untagged system 
evolves. The unperturbed limit as we saw was the solution of the heat 
equation (3.3). The underlying system is governed by Poisson jump pro- 
cesses z — 2’ with rate p(z’ — z)n(z)(1 — n(z’)). We can perturb this to 
p(z—2') + qn(t, z, 2’) introducing a spatial and temporal variation. Denot- 
ing by q the strength of the perturbation the entropy cost is the entropy 
of one Poisson process with respect to another which is seen to be of order 
q’. The number of sites is of the order of n% and time is of order n?. If we 
want the relative entropy to be of the order of magnitude n%, then q will 
have to be of order 1. The asymmetry has to be ” weak”. We introduce 
the perturbed operator 


(EnF) =n? Ip(e! 2) + alt, =, 2! —2))n(2)(1 le FO) - FO) 


Here q(t, x, z) determines the perturbation. b(t, 7) = S°,, z’ q(t, x, 2’) is the 
local bias. It is not hard to prove that we do have again a scaling limit but 
the equation is different. 

Oplt, x) 


ore 5VAVot,=) 9 $6 a,c ota). (4.1) 
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The interaction shows up in the nonlinearity and the term (1 — p(t, x)) is 
the effect of exclusion. The entropy cost normalized by n~¢@ is seen to be 


asymptotically 


aff ps PET) p(.2)( — p(t, x))dadt . 


We need to minimize >>, ie over q fixing b = 5°, zq(z). This is seen to 


equal (b, A~'b). The minimal cost of producing a b(t, x) is therefore 


a 
He ; / / 64, of}, ALE, aN old, aC — ptt, wade 


If we are only interested in producing a density profile p then we need to 
minimize J(b) over B(p(-)) i.e all b such that (4.1) holds. This can be done 
and the answer is as worked out in [4], is 


I( J(b) 


p) = rebut) 
. 1 
= ey  / [Peo — 5 VAVett, 2) |dedt 
= ; [(VPt0), ATVF(t2))olt,2)( — p(t, x))dt dz}. 


Now, at the third step, we turn to the question of large deviation of the 
empirical process R,(dw). It will have a large deviation principle on the 
space of measures on 2, with a rate function H(Q). The process Q will 
have a marginal q(t,x). If H(Q) is to be finite then J(q) has to be finite. 
For any b € B(q(-)) with J(b) < oo, there is a Markov process Qs, which 
is the motion of the tagged particle in the perturbed system. This process 
has the backward generator 


Va(t, x) 
q(t, x) 


i! 1 

aV  S(alt,2))V + 5 (S(alt, 2)) — A) -V+(1—aq(t,2))b-V. 
The process @ must have finite relative entropy H(Q;Q,) with respect 
to any Q, and therefore the stochastic integrals oe f(t, x(t)), dx(t)) make 
sense with respect to Q and we pick c € B(q(-)) such that 


£9] | * (alt, e(d)) dn(t)] = | f * (alt, a(t), det) 


0 
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The rate function is given by 


H(Q) = I(p0) + J(c) + H(Q; Qe). 
Details Can be found in [8]. 


We finally conclude with some results concerning the large deviation prob- 
abilities for the totally asymmetric simple exclusion process. We saw that 
the scaling was given as weak solutions of (3.2) that satisfied an entropy 
condition. The entropy condition can be stated as 


<0 (4.2) 


in the sense of distribution. Here h is a convex function and g/(p) = h'(p)(1— 
2p). Clearly for smooth solutions of (3.2), (4.2) will hold with equality. Of 
special interest for large deviations is the convex function h(p) = plog p+ 
(1 — p)log(1 — p). It turns out that the large deviation rate function is 
finite only for weak solutions of (3.2) and is given by the total mass of the 
positive part of the distribution € given in (4.2). These results can be found 
in [6], [10], and [11]. 
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